Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of...

5
Predicting Shopping Behavior with Mixture of RNNs Arthur Toth Rakuten Institute of Technology Boston, Massachusetts 02110 [email protected] Louis Tan Rakuten Institute of Technology Boston, Massachusetts 02110 [email protected] Giuseppe Di Fabbrizio Rakuten Institute of Technology Boston, Massachusetts 02110 [email protected] Ankur Datta Rakuten Institute of Technology Boston, Massachusetts 02110 [email protected] ABSTRACT We compare two machine learning approaches for early predic- tion of shoppers’ behaviors, leveraging features from clickstream data generated during live shopping sessions. Our baseline is a mixture of Markov models to predict three outcomes: purchase, abandoned shopping cart, and browsing-only. We then experiment with a mixture of Recurrent Neural Networks. When sequences are truncated to 75% of their length, a relatively small feature set predicts purchase with an F-measure of 0.80 and browsing-only with an F-measure of 0.98. We also investigate an entropy-based decision procedure. CCS CONCEPTS Computing methodologies Neural networks; Applied computing Online shopping; KEYWORDS mining/modeling search activity, click models, behavioral analysis ACM Reference format: Arthur Toth, Louis Tan, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Predicting Shopping Behavior with Mixture of RNNs. In Proceedings of SIGIR 2017 eCom, Tokyo, Japan, August 2017, 5 pages. 1 INTRODUCTION Recent e-commerce forecast analysis estimates that more than 1.77 billion users will shop online by the end of 2017 [11]. Although this is impressive growth, conversion rates for online shoppers are substantially lower than rates for traditional brick-and-mortar stores. Consumers shopping on e-commerce web sites are inuenced by numerous factors and may decide to stop the current session, leaving products in their shopping carts. Once a user has interacted with a shopping cart, abandonment rates range between 25% and 88%, signicantly reducing merchants’ selling opportunities [15]. Several potential purchase inhibitors have been analyzed in the Copyright © 2017 by the paper’s authors. Copying permitted for private and academic purposes. In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published at http://ceur-ws.org online shopping literature [5, 8]. Main causes include concerns about costs, perceived decision diculty, and selection conicts due to a large number of similar choices. Early shopping abandonment detection may allow mitigation of purchase inhibitors by enabling injection of new content into live web browsing sessions. For instance, it could trigger a discount oer or change the product search strategy to retrieve more diverse options and simplify the consumer’s decision process. This paper considers real web interactions from a US e-commerce subsidiary of Rakuten (Smw) to predict three possible outcomes: purchase, abandoned shopping cart, and browsing-only. To early detect outcomes, we consider clues left behind by shop- pers that are encoded into clickstream data and logged during each session. Clickstream data is used to experiment with mixtures of high-order Markov Chain Models (MCMs) and mixtures of Recur- rent Neural Networks (RNNs) that use the Long Short-Term Mem- ory (LSTM) architecture. We compare and contrast each model on sequences truncated at dierent lengths and report on precision, recall, and F-measures. We also show F-measures from using an entropy-based decision procedure that is usable in a live system. 2 RELATED WORK We treat predicting user behavior from clickstream data as sequence classication, which is broadly surveyed by Xing et al. [14], who divide it into feature-based, sequence distance-based, and model- based methods. Previous feature-based work on clickstream clas- sication includes the random forest used by Awalker et al. [2], the deep belief networks and stacked denoising auto-encoders by Viera [12], and recurrent neural networks by Wu et al. [13]. Previ- ous distance-based work includes the large margin nearest neighbor approach by Pai et al. [10]. Previous model-based work by Bertsimas et al. [4] used a mixture of Markov chains. Our baseline approach is based on the latter work, whereas our new approach uses a mixture of RNNs. Although Wu et al. [13] used RNNs, their approach is not applicable to our scenario, since its bi-directional RNN uses entire clickstream sequences. Our goal is to classify incomplete sequences. Also their model is not a mixture. Finally, our approach diers from most others in its use of a ternary classication scheme. We classify clickstreams as either being a purchase, abandon or browsing-only session instead of just purchase and non-purchase.

Transcript of Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of...

Page 1: Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of shoppers’ behaviors, leveraging features from clickstream data generated during live

Predicting Shopping Behavior with Mixture of RNNsArthur Toth

Rakuten Institute of TechnologyBoston Massachusetts 02110

arthurtothrakutencom

Louis TanRakuten Institute of Technology

Boston Massachusetts 02110ts-louistanrakutencom

Giuseppe Di FabbrizioRakuten Institute of Technology

Boston Massachusetts 02110difabbriziogmailcom

Ankur DattaRakuten Institute of Technology

Boston Massachusetts 02110ankurdattarakutencom

ABSTRACTWe compare two machine learning approaches for early predic-tion of shoppersrsquo behaviors leveraging features from clickstreamdata generated during live shopping sessions Our baseline is amixture of Markov models to predict three outcomes purchaseabandoned shopping cart and browsing-only We then experimentwith a mixture of Recurrent Neural Networks When sequencesare truncated to 75 of their length a relatively small feature setpredicts purchase with an F-measure of 080 and browsing-onlywith an F-measure of 098 We also investigate an entropy-baseddecision procedure

CCS CONCEPTSbull Computing methodologies rarr Neural networks bull Appliedcomputing rarr Online shopping

KEYWORDSminingmodeling search activity click models behavioral analysisACM Reference formatArthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Datta 2017 Predicting Shopping Behavior with Mixture of RNNs In Proceedings of SIGIR 2017 eCom Tokyo Japan August 2017 5 pages

1 INTRODUCTIONRecent e-commerce forecast analysis estimates that more than 177 billion users will shop online by the end of 2017 [11] Although this is impressive growth conversion rates for online shoppers are substantially lower than rates for traditional brick-and-mortar stores

Consumers shopping on e-commerce web sites are inuenced by numerous factors and may decide to stop the current session leaving products in their shopping carts Once a user has interacted with a shopping cart abandonment rates range between 25 and 88 signicantly reducing merchantsrsquo selling opportunities [15] Several potential purchase inhibitors have been analyzed in the

Copyright copy 2017 by the paperrsquos authors Copying permitted for private and academic purposesIn J Degenhardt S Kallumadi M de Rijke L Si A Trotman Y Xu (eds) Proceedings of the SIGIR 2017 eCom workshop August 2017 Tokyo Japan published at httpceur-wsorg

online shopping literature [5 8] Main causes include concernsabout costs perceived decision diculty and selection conictsdue to a large number of similar choices

Early shopping abandonment detection may allow mitigationof purchase inhibitors by enabling injection of new content intolive web browsing sessions For instance it could trigger a discountoer or change the product search strategy to retrieve more diverseoptions and simplify the consumerrsquos decision process

This paper considers real web interactions from a US e-commercesubsidiary of Rakuten (楽天株式会社) to predict three possibleoutcomes purchase abandoned shopping cart and browsing-onlyTo early detect outcomes we consider clues left behind by shop-pers that are encoded into clickstream data and logged during eachsession Clickstream data is used to experiment with mixtures ofhigh-order Markov Chain Models (MCMs) and mixtures of Recur-rent Neural Networks (RNNs) that use the Long Short-Term Mem-ory (LSTM) architecture We compare and contrast each model onsequences truncated at dierent lengths and report on precisionrecall and F-measures We also show F-measures from using anentropy-based decision procedure that is usable in a live system

2 RELATEDWORKWe treat predicting user behavior from clickstream data as sequenceclassication which is broadly surveyed by Xing et al [14] whodivide it into feature-based sequence distance-based and model-based methods Previous feature-based work on clickstream clas-sication includes the random forest used by Awalker et al [2]the deep belief networks and stacked denoising auto-encoders byViera [12] and recurrent neural networks by Wu et al [13] Previ-ous distance-based work includes the large margin nearest neighborapproach by Pai et al [10] Previous model-based work by Bertsimaset al [4] used a mixture of Markov chains

Our baseline approach is based on the latter work whereas ournew approach uses a mixture of RNNs Although Wu et al [13] usedRNNs their approach is not applicable to our scenario since itsbi-directional RNN uses entire clickstream sequences Our goal isto classify incomplete sequences Also their model is not a mixture

Finally our approach diers from most others in its use of aternary classication scheme We classify clickstreams as eitherbeing a purchase abandon or browsing-only session instead of justpurchase and non-purchase

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 1 Clickstream data distribution

Clickstream outcome Count Average MedianLength mi Length

ABANDON 29371 147 82 5BROWSING-ONLY 128450 646 166 6PURCHASE 41115 207 84 5

Total 198936 100

3 CLICKSTREAM DATAWe consider clickstream data collected over two weeks and con-sisting of n0 = 1 560 830 sessions A session Si i = 1 2 n0 is achronological sequence of recorded page views or ldquoclicksrdquo Let Vi jbe the jth page view of session i so that Si = (Vi1Vi2 Vimi )and mi is dened as the length of session i We exclude sessioni from our experiments if mi lt 4 after which n = 198 936 ses-sions remain Sessions with purchases are truncated before the rstpurchase conrmation Table 1 summarizes the n sessions

Our experiments use both the page type and dwell time of Vi j The page type of Vi j denoted as Pi j belongs to one of eight cate-gories including search pages product view pages login pages etcThe dwell time Di j of Vi j is the amount of time the user spendsviewing the page and is not available until the (j + 1)th page viewAfter the jth page view the clickstream data gathered for session i isgiven by Si |j = ((Pi1Di0) (Pi2Di1) (Pi j Di jminus1)) whereDi0 is undened ie Di0 = empty To reduce sparsity dwell timeswere placed in 8 bins evenly spaced by percentiles

4 MODELING APPROACHESOur goal is to classify customer behavior into nal decision cat-egories In particular clickstream sequences receive one of thefollowing labels PURCHASE if the sequence leads to an item pur-chase ABANDON if an item was left in the shopping cart butthere was no purchase and BROWSING-ONLY when the shoppingcart was not used The nal two categories can be combined toinvestigate PURCHASE vs NON_PURCHASE behavior In prelimi-nary studies the ABANDON sequences were much more similarto the PURCHASE sequences than to the BROWSING-ONLY se-quences so having three categories helped account for some ofthe confusability of the data Our eventual goal of applying ourmodels in a live system adds a constraint The classier must workfor incomplete sequences without using data from the ldquofuturerdquo

41 Mixture of High-Order Markov ChainsBertsimas et al [4] modeled a similar problem with a mixture ofMarkov chains using maximum a posteriori estimation to predictsequence labels In this approach the training data is partitioned byclass and separate Markov chain models are trained for each oneThe resultant models can estimate class likelihoods from sequencesLetCi be a random variable representing the outcome of session Si where Ci isin Ω and Ω is the set of possible clickstream outcomesThe models would be used to estimate the likelihoods P (Si |Ci = ω)for all ω isin Ω Using Bayesrsquo theorem and the law of total probabilitythe class posteriors for each of the three classes can be estimated

by the equation

P (Ci = ω |Si ) =P (Si |Ci = ω)P (Ci = ω)sum

ω isinΩ P (Si |Ci = ω)P (Ci = ω)

with the prior P (Ci = ω) estimated from countsThis model ts the problemrsquos constraints because likelihoods

can be produced from subsequences without using ldquofuturerdquo clicksAlthough each chain is trained only on click data separating databy class implicitly conditions them on class

Taking inspiration from the Automatic Speech Recognition (ASR)community and similarities to ldquoLanguage Modelingrdquo we adaptedsome of their more recent techniques to our problem In preliminaryexperiments 5-grams performed better than shorter chains so weused them Longer chains cause greater sparsity so we addressedthis with Kneser-Ney smoothing which performed best in a studyof language modeling techniques [6] We used the MITLM toolkitto train the Markov chains [7]

42 Recurrent Neural NetworksTaking further inspiration from the ASR community we replacedthe Markov-chains in our mixtures with RNNs Earlier languagemodeling work used feed-forward articial neural networks [3]but RNNs have performed better recently both in direct likelihoodmetrics and in overall ASR task metrics [9 16] Click-stream datadiers from ASR text and our mixture model diered from thetypical ASR approach so it was unclear whether RNNs would helpin our scenario

Figure 1 Simple RNN to Predict Following Token

Earlier RNN-based language models used the ldquosimple recurrentneural networkrdquo architecture [9] The underlying idea depicted inFigure 1 is that an input sequence represented by X0 Xtminus1is connected to a recurrent layer represented by A0 At whichis also connected to itself at the next point in time The recurrentlayer is also connected to an output layer In our models each RNNtries to predict the next input Xn+1 after each input Xn

ldquoSimplerdquo RNN-based language models for ASR were outper-formed by RNNs using the Long Short-Term Memory (LSTM) con-guration and ldquodrop-outrdquo [16] LSTM addressed the ldquovanishinggradientrdquo and ldquoexploding gradientrdquo problems ldquoDrop-outrdquo addressedovertting by probabilistically ignoring the contributions of non-recurrent nodes during training

In LSTM RNNs some nodes function as ldquomemory cellsrdquo Someconnections retrieve information from them and others cause themto forget The LSTM equations are [16]

LSTM hlminus1t hltminus1c

ltminus1rarrhlt c

lt

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

ifoд

+-

=

siдmsiдmsiдmsiдm

+-

T2n4n

hlminus1t

hltminus1

+-

c lt=f cltminus1+iд

hlt=otanh (clt )

where h and c represent hidden states and memory cells subscriptsrefer to time superscripts refer to layersTnm is an ane transformfrom Rn to Rm refers to element-wise multiplication and siдmand tanh are applied to each element

Since LSTM RNNs with drop-out worked much better thanMarkov chains for ASR we replaced the Markov chains with themin our clickstream mixture models Additional work was neces-sary since our scenario did not exactly match language modelingDuring training input tokens were still used to predict followingtokens During testing however our goal was the sequence prob-abilities These were calculated from token probabilities presentin intermediate softmax layers in each LSTM model Due to thenetwork architecture ldquofuturerdquo events were not used for this Weused TensorFlow to implement our LSTM RNNs [1]

5 EXPERIMENTSWe experimented on the dataset described in Section 3 and summa-rized in Table 1 It was partitioned into an 80 training20 testingsplit using stratied sampling to retain class proportions

All RNNs were trained for 10 epochs using batches of 20 se-quences We tested 16 combinations of the remaining parametersThe number of recurrent layers was 1 or 2 the keep probabilitywas 1 or 05 and the hidden state size was 10 20 40 or 80 For aparticular mixture model all the RNNs used the same parametervalues

51 ResultsFor each model we evaluated the prediction performance by trun-cating the page view sequences at dierent lengths Table 2 showsthe results for the mixture of Markov models and for one of themixture of RNN trials Although we tested 16 dierent RNN parame-ter combinations results were so similar that we are only reportingon one of them

Table 2 reports precision recall and F1-measure for each specicsequence outcome when considering 25 50 75 and 100 ofthe total length of the sequence For instance when splitting at50 the Markov chain model can predict a PURCHASE with a042 precision and 011 recall resulting in an overall F1-measureof 017 For the same conditions the RNN-based model reaches aprecision of 082 with 071 recall and an F1-measure of 076 Wealso report the accuracy when randomly selecting a class basedon the prior distribution of the clickstream corpus RNN mixturecomponents substantially outperform Markov chain componentsThis is particularly evident from Figure 2 which shows the F1-measure by sequence length for the mixtures of MCMs (dottedline) and of RNNs (solid line) Both models monotonically increaseperformance as the model observes more data from 25 splits to100 but the mixture of RNNs has an immediate edge even withthe short 25 sequences The MCMs present similar F1-measuresfor the majority class (ie BROWSING-ONLY) but it is penalized

0

01

02

03

04

05

06

07

08

09

1

25 50 75 100

F1-measurebysequencelength

MCM-ABN MCM-BRO MCM-PRC

RNN-ABN RNN-BRO RNN-PRC

Figure 2 F1-measure by sequence length for mixtures of MarkovChain Models (MCM) and RNNs for each label ABN=abandonedBRO=browsing-only PRC=purchase

by the lack of data for the less represented sequences (ie 147 forABANDON and 207 for PURCHASE) RNNs instead generalizebetter due to the memory component that can model long-distancedependencies

BROWSING-ONLYABANDON

PURCHASE

Figure 3 Log-probability trajectories from the MCMmixture pro-gressing along page view sequences for each class

Similarly to Bertsimas et al [4] in Figure 3 we plot 100 log-probability trajectories with lengths from 2 through 20 page viewsestimated by the MCM mixture along page view sequences foreach class This plot demonstrates how probabilities evolve duringinteractions and how condent each model is compared to others

The legend in Figure 3 corresponds to the true nal state labelsof the test examples dashed red lines for ABANDON sequencesdashed and dotted green lines for BROWSING-ONLY sequencesand solid blue lines for PURCHASE sequences Ideally the model

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 2 Precision (P) Recall (R) and F1-measure (F) for MCM and RNN mixtures by sequence length

Mixture Type Sequence Length 25 50 75 100

MCM Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 044 002 004 045 009 014 072 052 061 070 064 067 021BROWSING-ONLY 066 099 079 069 098 081 089 097 092 099 095 097 064PURCHASE 037 002 004 042 011 017 072 067 069 071 086 078 015

RNN Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 075 039 052 073 062 067 074 072 073 100 084 091 021BROWSING-ONLY 084 099 091 092 099 096 097 099 098 100 098 099 064PURCHASE 080 064 071 082 071 076 082 078 080 086 100 092 015

would score the PURCHASE sequences (solid blue lines) high andthe other sequences low and the earlier the distinction could bemade the better Looking at this gure there does appear to besome level of discrimination between the categories In generalthe BROWSING-ONLY sequences seem more separable from thePURCHASE sequences than the ABANDON sequences as expected

Although Table 2 can be used to compare other work [10] it de-pends on sequence lengths A useful live system must predict nalactions before sequences are complete and needs a decision processfor accepting the predicted label We experimented with takingthe highest scoring label once the entropy fell below a thresholdFigure 4 shows the F1-measures at dierent thresholds which areproportions of the maximum possible entropy Since the entropymight not drop below the threshold it is important to considerhow many sequences have predicted labels When we calculated F1-measures sequences without predictions were counted as missesFigure 5 shows the number of sequences where predictions weremade based on entropy threshold Again the horizontal axis rep-resents the proportion of the maximum possible entropy valueThe vertical axis represents the number of sequences where a de-cision can be made based on threshold crossing As an example athreshold of 055 led to reasonable F1-measures while producingpredictions for 99 of the sequences before they were completeChoosing higher entropy thresholds allows decisions to be madefor more sequences but performance can suer since decisions canbe made while class probabilities are more uniform and condenceis lower Choosing lower entropy thresholds forces the class proba-bilities to be more distinct which leads to more condent decisionsbut performance starts to suer when fewer sequences receivedecisions In practice the threshold would be tuned on held-outdata

6 CONCLUSION AND FUTUREWORKWe presented two models for the real-time early prediction ofshopping session outcomes on an e-commerce platform We demon-strated that LSTM RNNs generalize better and with less data thanhigh-order Markov chain models used in previous work Our ap-proach in addition to distinguishing browsing-only and cart-interactionsessions can also accurately discriminate between cart abandon-ment and purchase sessions Future work will focus on featuressingle RNN architectures and decision strategies

0

02

04

06

08

1

099 088 077 066 055 044 033 022 011

F1-m

easure

EntropyThreshold (Proportion ofMaximumEntropy)

F1-measurebyLabelDistribution Entropy

Abandon No_Cart_Interaction Purchase

Figure 4 F1-measures at Entropy Thresholds for each class

05000

1000015000200002500030000350004000045000

099 088 077 066 055 044 033 022 011

NumberofSequences

EntropyThreshold (Proportion ofMaximumEntropy)

SequenceswhereEntropyGoesBelowThreshold

Figure 5 Sequences Crossing Entropy Thresholds

REFERENCES[1] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Greg S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dan Maneacute Rajat Monga Sherry Moore Derek Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar PaulTucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2015 TensorFlow Large-Scale Machine Learning on Heterogeneous Systems(2015) httptensoroworg Software available from tensoroworg

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

[2] Aditya Awalkar Ibrahim Ahmed and Tejas Nevrekar 2016 Prediction of UserrsquosPurchases using Clickstream Data International Journal of Engineering Scienceand Computing (2016)

[3] Yoshua Bengio Reacutejean Ducharme Pascal Vincent and Christian Janvin 2003A Neural Probabilistic Language Model J Mach Learn Res 3 (March 2003)1137ndash1155

[4] Dimitris Bertsimas Adam J Mersereau and Nitin R Patel 2003 DynamicClassication of Online Customers In Proceedings of the Third SIAM InternationalConference on Data Mining San Francisco CA USA May 1-3 2003 107ndash118

[5] Robert Florentin 2016 Online shopping abandonment rate a new perspective therole of choice conicts as a factor of online shopping abandonment Masterrsquos thesisUniversity of Twente

[6] Joshua T Goodman 2001 A Bit of Progress in Language Modeling ComputSpeech Lang 15 4 (Oct 2001) 403ndash434

[7] Bo-June Paul Hsu and James R Glass 2008 Iterative language model estimationecient data structure amp algorithms In INTERSPEECH 2008 9th Annual Confer-ence of the International Speech Communication Association Brisbane AustraliaSeptember 22-26 2008 841ndash844

[8] Monika Kukar-Kinney and Angeline G Close 2010 The determinants of con-sumersrsquo online shopping cart abandonment Journal of the Academy of MarketingScience 38 2 (2010) 240ndash250

[9] Tomaacuteš Mikolov Martin Karaaacutet Lukaacuteš Burget Jan Černockyacute and Sanjeev Khu-danpur 2010 Recurrent neural network based language model In The 11thAnnual Conference of the International Speech Communication Association (IN-TERSPEECH 2010)

[10] Deepak Pai Abhijit Sharang Meghanath Macha Yadagiri and Shradha Agrawal2014 Modelling Visit Similarity Using Click-Stream Data A Supervised ApproachSpringer International Publishing 135ndash145

[11] The Statistics Portal 2014 Digital buyer penetration world-wide from 2014 to 2019 httpwwwstatistacomstatistics261676digital-buyer-penetration-worldwide (2014) Accessed 2016-09-10

[12] Armando Vieira 2015 Predicting online user behaviour using deep learningalgorithms The Computing Research Repository (CoRR) abs151106247 (2015)

[13] Zhenzhou Wu Bao Hong Tan Rubing Duan Yong Liu and Rick Siow Mong Goh2015 Neural Modeling of Buying Behaviour for E-Commerce from ClickingPatterns In Proceedings of the 2015 International ACM Recommender SystemsChallenge (RecSys rsquo15 Challenge) ACM New York NY USA Article 12 4 pages

[14] Zhengzheng Xing Jian Pei and Eamonn Keogh 2010 A Brief Survey on SequenceClassication SIGKDD Explor Newsl 12 1 (Nov 2010) 40ndash48

[15] Yin Xu and Jin-Song Huang 2015 Factors Inuencing Cart Abandonment inthe Online Shopping Process Social Behavior and Personality an internationaljournal 43 10 (Nov 2015)

[16] Wojciech Zaremba Ilya Sutskever and Oriol Vinyals 2014 Recurrent Neural Net-work Regularization The Computing Research Repository (CoRR) abs14092329(2014) httparxivorgabs14092329

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 Clickstream Data
  • 4 Modeling Approaches
    • 41 Mixture of High-Order Markov Chains
    • 42 Recurrent Neural Networks
      • 5 Experiments
        • 51 Results
          • 6 Conclusion and Future Work
          • References
Page 2: Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of shoppers’ behaviors, leveraging features from clickstream data generated during live

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 1 Clickstream data distribution

Clickstream outcome Count Average MedianLength mi Length

ABANDON 29371 147 82 5BROWSING-ONLY 128450 646 166 6PURCHASE 41115 207 84 5

Total 198936 100

3 CLICKSTREAM DATAWe consider clickstream data collected over two weeks and con-sisting of n0 = 1 560 830 sessions A session Si i = 1 2 n0 is achronological sequence of recorded page views or ldquoclicksrdquo Let Vi jbe the jth page view of session i so that Si = (Vi1Vi2 Vimi )and mi is dened as the length of session i We exclude sessioni from our experiments if mi lt 4 after which n = 198 936 ses-sions remain Sessions with purchases are truncated before the rstpurchase conrmation Table 1 summarizes the n sessions

Our experiments use both the page type and dwell time of Vi j The page type of Vi j denoted as Pi j belongs to one of eight cate-gories including search pages product view pages login pages etcThe dwell time Di j of Vi j is the amount of time the user spendsviewing the page and is not available until the (j + 1)th page viewAfter the jth page view the clickstream data gathered for session i isgiven by Si |j = ((Pi1Di0) (Pi2Di1) (Pi j Di jminus1)) whereDi0 is undened ie Di0 = empty To reduce sparsity dwell timeswere placed in 8 bins evenly spaced by percentiles

4 MODELING APPROACHESOur goal is to classify customer behavior into nal decision cat-egories In particular clickstream sequences receive one of thefollowing labels PURCHASE if the sequence leads to an item pur-chase ABANDON if an item was left in the shopping cart butthere was no purchase and BROWSING-ONLY when the shoppingcart was not used The nal two categories can be combined toinvestigate PURCHASE vs NON_PURCHASE behavior In prelimi-nary studies the ABANDON sequences were much more similarto the PURCHASE sequences than to the BROWSING-ONLY se-quences so having three categories helped account for some ofthe confusability of the data Our eventual goal of applying ourmodels in a live system adds a constraint The classier must workfor incomplete sequences without using data from the ldquofuturerdquo

41 Mixture of High-Order Markov ChainsBertsimas et al [4] modeled a similar problem with a mixture ofMarkov chains using maximum a posteriori estimation to predictsequence labels In this approach the training data is partitioned byclass and separate Markov chain models are trained for each oneThe resultant models can estimate class likelihoods from sequencesLetCi be a random variable representing the outcome of session Si where Ci isin Ω and Ω is the set of possible clickstream outcomesThe models would be used to estimate the likelihoods P (Si |Ci = ω)for all ω isin Ω Using Bayesrsquo theorem and the law of total probabilitythe class posteriors for each of the three classes can be estimated

by the equation

P (Ci = ω |Si ) =P (Si |Ci = ω)P (Ci = ω)sum

ω isinΩ P (Si |Ci = ω)P (Ci = ω)

with the prior P (Ci = ω) estimated from countsThis model ts the problemrsquos constraints because likelihoods

can be produced from subsequences without using ldquofuturerdquo clicksAlthough each chain is trained only on click data separating databy class implicitly conditions them on class

Taking inspiration from the Automatic Speech Recognition (ASR)community and similarities to ldquoLanguage Modelingrdquo we adaptedsome of their more recent techniques to our problem In preliminaryexperiments 5-grams performed better than shorter chains so weused them Longer chains cause greater sparsity so we addressedthis with Kneser-Ney smoothing which performed best in a studyof language modeling techniques [6] We used the MITLM toolkitto train the Markov chains [7]

42 Recurrent Neural NetworksTaking further inspiration from the ASR community we replacedthe Markov-chains in our mixtures with RNNs Earlier languagemodeling work used feed-forward articial neural networks [3]but RNNs have performed better recently both in direct likelihoodmetrics and in overall ASR task metrics [9 16] Click-stream datadiers from ASR text and our mixture model diered from thetypical ASR approach so it was unclear whether RNNs would helpin our scenario

Figure 1 Simple RNN to Predict Following Token

Earlier RNN-based language models used the ldquosimple recurrentneural networkrdquo architecture [9] The underlying idea depicted inFigure 1 is that an input sequence represented by X0 Xtminus1is connected to a recurrent layer represented by A0 At whichis also connected to itself at the next point in time The recurrentlayer is also connected to an output layer In our models each RNNtries to predict the next input Xn+1 after each input Xn

ldquoSimplerdquo RNN-based language models for ASR were outper-formed by RNNs using the Long Short-Term Memory (LSTM) con-guration and ldquodrop-outrdquo [16] LSTM addressed the ldquovanishinggradientrdquo and ldquoexploding gradientrdquo problems ldquoDrop-outrdquo addressedovertting by probabilistically ignoring the contributions of non-recurrent nodes during training

In LSTM RNNs some nodes function as ldquomemory cellsrdquo Someconnections retrieve information from them and others cause themto forget The LSTM equations are [16]

LSTM hlminus1t hltminus1c

ltminus1rarrhlt c

lt

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

ifoд

+-

=

siдmsiдmsiдmsiдm

+-

T2n4n

hlminus1t

hltminus1

+-

c lt=f cltminus1+iд

hlt=otanh (clt )

where h and c represent hidden states and memory cells subscriptsrefer to time superscripts refer to layersTnm is an ane transformfrom Rn to Rm refers to element-wise multiplication and siдmand tanh are applied to each element

Since LSTM RNNs with drop-out worked much better thanMarkov chains for ASR we replaced the Markov chains with themin our clickstream mixture models Additional work was neces-sary since our scenario did not exactly match language modelingDuring training input tokens were still used to predict followingtokens During testing however our goal was the sequence prob-abilities These were calculated from token probabilities presentin intermediate softmax layers in each LSTM model Due to thenetwork architecture ldquofuturerdquo events were not used for this Weused TensorFlow to implement our LSTM RNNs [1]

5 EXPERIMENTSWe experimented on the dataset described in Section 3 and summa-rized in Table 1 It was partitioned into an 80 training20 testingsplit using stratied sampling to retain class proportions

All RNNs were trained for 10 epochs using batches of 20 se-quences We tested 16 combinations of the remaining parametersThe number of recurrent layers was 1 or 2 the keep probabilitywas 1 or 05 and the hidden state size was 10 20 40 or 80 For aparticular mixture model all the RNNs used the same parametervalues

51 ResultsFor each model we evaluated the prediction performance by trun-cating the page view sequences at dierent lengths Table 2 showsthe results for the mixture of Markov models and for one of themixture of RNN trials Although we tested 16 dierent RNN parame-ter combinations results were so similar that we are only reportingon one of them

Table 2 reports precision recall and F1-measure for each specicsequence outcome when considering 25 50 75 and 100 ofthe total length of the sequence For instance when splitting at50 the Markov chain model can predict a PURCHASE with a042 precision and 011 recall resulting in an overall F1-measureof 017 For the same conditions the RNN-based model reaches aprecision of 082 with 071 recall and an F1-measure of 076 Wealso report the accuracy when randomly selecting a class basedon the prior distribution of the clickstream corpus RNN mixturecomponents substantially outperform Markov chain componentsThis is particularly evident from Figure 2 which shows the F1-measure by sequence length for the mixtures of MCMs (dottedline) and of RNNs (solid line) Both models monotonically increaseperformance as the model observes more data from 25 splits to100 but the mixture of RNNs has an immediate edge even withthe short 25 sequences The MCMs present similar F1-measuresfor the majority class (ie BROWSING-ONLY) but it is penalized

0

01

02

03

04

05

06

07

08

09

1

25 50 75 100

F1-measurebysequencelength

MCM-ABN MCM-BRO MCM-PRC

RNN-ABN RNN-BRO RNN-PRC

Figure 2 F1-measure by sequence length for mixtures of MarkovChain Models (MCM) and RNNs for each label ABN=abandonedBRO=browsing-only PRC=purchase

by the lack of data for the less represented sequences (ie 147 forABANDON and 207 for PURCHASE) RNNs instead generalizebetter due to the memory component that can model long-distancedependencies

BROWSING-ONLYABANDON

PURCHASE

Figure 3 Log-probability trajectories from the MCMmixture pro-gressing along page view sequences for each class

Similarly to Bertsimas et al [4] in Figure 3 we plot 100 log-probability trajectories with lengths from 2 through 20 page viewsestimated by the MCM mixture along page view sequences foreach class This plot demonstrates how probabilities evolve duringinteractions and how condent each model is compared to others

The legend in Figure 3 corresponds to the true nal state labelsof the test examples dashed red lines for ABANDON sequencesdashed and dotted green lines for BROWSING-ONLY sequencesand solid blue lines for PURCHASE sequences Ideally the model

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 2 Precision (P) Recall (R) and F1-measure (F) for MCM and RNN mixtures by sequence length

Mixture Type Sequence Length 25 50 75 100

MCM Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 044 002 004 045 009 014 072 052 061 070 064 067 021BROWSING-ONLY 066 099 079 069 098 081 089 097 092 099 095 097 064PURCHASE 037 002 004 042 011 017 072 067 069 071 086 078 015

RNN Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 075 039 052 073 062 067 074 072 073 100 084 091 021BROWSING-ONLY 084 099 091 092 099 096 097 099 098 100 098 099 064PURCHASE 080 064 071 082 071 076 082 078 080 086 100 092 015

would score the PURCHASE sequences (solid blue lines) high andthe other sequences low and the earlier the distinction could bemade the better Looking at this gure there does appear to besome level of discrimination between the categories In generalthe BROWSING-ONLY sequences seem more separable from thePURCHASE sequences than the ABANDON sequences as expected

Although Table 2 can be used to compare other work [10] it de-pends on sequence lengths A useful live system must predict nalactions before sequences are complete and needs a decision processfor accepting the predicted label We experimented with takingthe highest scoring label once the entropy fell below a thresholdFigure 4 shows the F1-measures at dierent thresholds which areproportions of the maximum possible entropy Since the entropymight not drop below the threshold it is important to considerhow many sequences have predicted labels When we calculated F1-measures sequences without predictions were counted as missesFigure 5 shows the number of sequences where predictions weremade based on entropy threshold Again the horizontal axis rep-resents the proportion of the maximum possible entropy valueThe vertical axis represents the number of sequences where a de-cision can be made based on threshold crossing As an example athreshold of 055 led to reasonable F1-measures while producingpredictions for 99 of the sequences before they were completeChoosing higher entropy thresholds allows decisions to be madefor more sequences but performance can suer since decisions canbe made while class probabilities are more uniform and condenceis lower Choosing lower entropy thresholds forces the class proba-bilities to be more distinct which leads to more condent decisionsbut performance starts to suer when fewer sequences receivedecisions In practice the threshold would be tuned on held-outdata

6 CONCLUSION AND FUTUREWORKWe presented two models for the real-time early prediction ofshopping session outcomes on an e-commerce platform We demon-strated that LSTM RNNs generalize better and with less data thanhigh-order Markov chain models used in previous work Our ap-proach in addition to distinguishing browsing-only and cart-interactionsessions can also accurately discriminate between cart abandon-ment and purchase sessions Future work will focus on featuressingle RNN architectures and decision strategies

0

02

04

06

08

1

099 088 077 066 055 044 033 022 011

F1-m

easure

EntropyThreshold (Proportion ofMaximumEntropy)

F1-measurebyLabelDistribution Entropy

Abandon No_Cart_Interaction Purchase

Figure 4 F1-measures at Entropy Thresholds for each class

05000

1000015000200002500030000350004000045000

099 088 077 066 055 044 033 022 011

NumberofSequences

EntropyThreshold (Proportion ofMaximumEntropy)

SequenceswhereEntropyGoesBelowThreshold

Figure 5 Sequences Crossing Entropy Thresholds

REFERENCES[1] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Greg S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dan Maneacute Rajat Monga Sherry Moore Derek Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar PaulTucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2015 TensorFlow Large-Scale Machine Learning on Heterogeneous Systems(2015) httptensoroworg Software available from tensoroworg

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

[2] Aditya Awalkar Ibrahim Ahmed and Tejas Nevrekar 2016 Prediction of UserrsquosPurchases using Clickstream Data International Journal of Engineering Scienceand Computing (2016)

[3] Yoshua Bengio Reacutejean Ducharme Pascal Vincent and Christian Janvin 2003A Neural Probabilistic Language Model J Mach Learn Res 3 (March 2003)1137ndash1155

[4] Dimitris Bertsimas Adam J Mersereau and Nitin R Patel 2003 DynamicClassication of Online Customers In Proceedings of the Third SIAM InternationalConference on Data Mining San Francisco CA USA May 1-3 2003 107ndash118

[5] Robert Florentin 2016 Online shopping abandonment rate a new perspective therole of choice conicts as a factor of online shopping abandonment Masterrsquos thesisUniversity of Twente

[6] Joshua T Goodman 2001 A Bit of Progress in Language Modeling ComputSpeech Lang 15 4 (Oct 2001) 403ndash434

[7] Bo-June Paul Hsu and James R Glass 2008 Iterative language model estimationecient data structure amp algorithms In INTERSPEECH 2008 9th Annual Confer-ence of the International Speech Communication Association Brisbane AustraliaSeptember 22-26 2008 841ndash844

[8] Monika Kukar-Kinney and Angeline G Close 2010 The determinants of con-sumersrsquo online shopping cart abandonment Journal of the Academy of MarketingScience 38 2 (2010) 240ndash250

[9] Tomaacuteš Mikolov Martin Karaaacutet Lukaacuteš Burget Jan Černockyacute and Sanjeev Khu-danpur 2010 Recurrent neural network based language model In The 11thAnnual Conference of the International Speech Communication Association (IN-TERSPEECH 2010)

[10] Deepak Pai Abhijit Sharang Meghanath Macha Yadagiri and Shradha Agrawal2014 Modelling Visit Similarity Using Click-Stream Data A Supervised ApproachSpringer International Publishing 135ndash145

[11] The Statistics Portal 2014 Digital buyer penetration world-wide from 2014 to 2019 httpwwwstatistacomstatistics261676digital-buyer-penetration-worldwide (2014) Accessed 2016-09-10

[12] Armando Vieira 2015 Predicting online user behaviour using deep learningalgorithms The Computing Research Repository (CoRR) abs151106247 (2015)

[13] Zhenzhou Wu Bao Hong Tan Rubing Duan Yong Liu and Rick Siow Mong Goh2015 Neural Modeling of Buying Behaviour for E-Commerce from ClickingPatterns In Proceedings of the 2015 International ACM Recommender SystemsChallenge (RecSys rsquo15 Challenge) ACM New York NY USA Article 12 4 pages

[14] Zhengzheng Xing Jian Pei and Eamonn Keogh 2010 A Brief Survey on SequenceClassication SIGKDD Explor Newsl 12 1 (Nov 2010) 40ndash48

[15] Yin Xu and Jin-Song Huang 2015 Factors Inuencing Cart Abandonment inthe Online Shopping Process Social Behavior and Personality an internationaljournal 43 10 (Nov 2015)

[16] Wojciech Zaremba Ilya Sutskever and Oriol Vinyals 2014 Recurrent Neural Net-work Regularization The Computing Research Repository (CoRR) abs14092329(2014) httparxivorgabs14092329

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 Clickstream Data
  • 4 Modeling Approaches
    • 41 Mixture of High-Order Markov Chains
    • 42 Recurrent Neural Networks
      • 5 Experiments
        • 51 Results
          • 6 Conclusion and Future Work
          • References
Page 3: Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of shoppers’ behaviors, leveraging features from clickstream data generated during live

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

ifoд

+-

=

siдmsiдmsiдmsiдm

+-

T2n4n

hlminus1t

hltminus1

+-

c lt=f cltminus1+iд

hlt=otanh (clt )

where h and c represent hidden states and memory cells subscriptsrefer to time superscripts refer to layersTnm is an ane transformfrom Rn to Rm refers to element-wise multiplication and siдmand tanh are applied to each element

Since LSTM RNNs with drop-out worked much better thanMarkov chains for ASR we replaced the Markov chains with themin our clickstream mixture models Additional work was neces-sary since our scenario did not exactly match language modelingDuring training input tokens were still used to predict followingtokens During testing however our goal was the sequence prob-abilities These were calculated from token probabilities presentin intermediate softmax layers in each LSTM model Due to thenetwork architecture ldquofuturerdquo events were not used for this Weused TensorFlow to implement our LSTM RNNs [1]

5 EXPERIMENTSWe experimented on the dataset described in Section 3 and summa-rized in Table 1 It was partitioned into an 80 training20 testingsplit using stratied sampling to retain class proportions

All RNNs were trained for 10 epochs using batches of 20 se-quences We tested 16 combinations of the remaining parametersThe number of recurrent layers was 1 or 2 the keep probabilitywas 1 or 05 and the hidden state size was 10 20 40 or 80 For aparticular mixture model all the RNNs used the same parametervalues

51 ResultsFor each model we evaluated the prediction performance by trun-cating the page view sequences at dierent lengths Table 2 showsthe results for the mixture of Markov models and for one of themixture of RNN trials Although we tested 16 dierent RNN parame-ter combinations results were so similar that we are only reportingon one of them

Table 2 reports precision recall and F1-measure for each specicsequence outcome when considering 25 50 75 and 100 ofthe total length of the sequence For instance when splitting at50 the Markov chain model can predict a PURCHASE with a042 precision and 011 recall resulting in an overall F1-measureof 017 For the same conditions the RNN-based model reaches aprecision of 082 with 071 recall and an F1-measure of 076 Wealso report the accuracy when randomly selecting a class basedon the prior distribution of the clickstream corpus RNN mixturecomponents substantially outperform Markov chain componentsThis is particularly evident from Figure 2 which shows the F1-measure by sequence length for the mixtures of MCMs (dottedline) and of RNNs (solid line) Both models monotonically increaseperformance as the model observes more data from 25 splits to100 but the mixture of RNNs has an immediate edge even withthe short 25 sequences The MCMs present similar F1-measuresfor the majority class (ie BROWSING-ONLY) but it is penalized

0

01

02

03

04

05

06

07

08

09

1

25 50 75 100

F1-measurebysequencelength

MCM-ABN MCM-BRO MCM-PRC

RNN-ABN RNN-BRO RNN-PRC

Figure 2 F1-measure by sequence length for mixtures of MarkovChain Models (MCM) and RNNs for each label ABN=abandonedBRO=browsing-only PRC=purchase

by the lack of data for the less represented sequences (ie 147 forABANDON and 207 for PURCHASE) RNNs instead generalizebetter due to the memory component that can model long-distancedependencies

BROWSING-ONLYABANDON

PURCHASE

Figure 3 Log-probability trajectories from the MCMmixture pro-gressing along page view sequences for each class

Similarly to Bertsimas et al [4] in Figure 3 we plot 100 log-probability trajectories with lengths from 2 through 20 page viewsestimated by the MCM mixture along page view sequences foreach class This plot demonstrates how probabilities evolve duringinteractions and how condent each model is compared to others

The legend in Figure 3 corresponds to the true nal state labelsof the test examples dashed red lines for ABANDON sequencesdashed and dotted green lines for BROWSING-ONLY sequencesand solid blue lines for PURCHASE sequences Ideally the model

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 2 Precision (P) Recall (R) and F1-measure (F) for MCM and RNN mixtures by sequence length

Mixture Type Sequence Length 25 50 75 100

MCM Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 044 002 004 045 009 014 072 052 061 070 064 067 021BROWSING-ONLY 066 099 079 069 098 081 089 097 092 099 095 097 064PURCHASE 037 002 004 042 011 017 072 067 069 071 086 078 015

RNN Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 075 039 052 073 062 067 074 072 073 100 084 091 021BROWSING-ONLY 084 099 091 092 099 096 097 099 098 100 098 099 064PURCHASE 080 064 071 082 071 076 082 078 080 086 100 092 015

would score the PURCHASE sequences (solid blue lines) high andthe other sequences low and the earlier the distinction could bemade the better Looking at this gure there does appear to besome level of discrimination between the categories In generalthe BROWSING-ONLY sequences seem more separable from thePURCHASE sequences than the ABANDON sequences as expected

Although Table 2 can be used to compare other work [10] it de-pends on sequence lengths A useful live system must predict nalactions before sequences are complete and needs a decision processfor accepting the predicted label We experimented with takingthe highest scoring label once the entropy fell below a thresholdFigure 4 shows the F1-measures at dierent thresholds which areproportions of the maximum possible entropy Since the entropymight not drop below the threshold it is important to considerhow many sequences have predicted labels When we calculated F1-measures sequences without predictions were counted as missesFigure 5 shows the number of sequences where predictions weremade based on entropy threshold Again the horizontal axis rep-resents the proportion of the maximum possible entropy valueThe vertical axis represents the number of sequences where a de-cision can be made based on threshold crossing As an example athreshold of 055 led to reasonable F1-measures while producingpredictions for 99 of the sequences before they were completeChoosing higher entropy thresholds allows decisions to be madefor more sequences but performance can suer since decisions canbe made while class probabilities are more uniform and condenceis lower Choosing lower entropy thresholds forces the class proba-bilities to be more distinct which leads to more condent decisionsbut performance starts to suer when fewer sequences receivedecisions In practice the threshold would be tuned on held-outdata

6 CONCLUSION AND FUTUREWORKWe presented two models for the real-time early prediction ofshopping session outcomes on an e-commerce platform We demon-strated that LSTM RNNs generalize better and with less data thanhigh-order Markov chain models used in previous work Our ap-proach in addition to distinguishing browsing-only and cart-interactionsessions can also accurately discriminate between cart abandon-ment and purchase sessions Future work will focus on featuressingle RNN architectures and decision strategies

0

02

04

06

08

1

099 088 077 066 055 044 033 022 011

F1-m

easure

EntropyThreshold (Proportion ofMaximumEntropy)

F1-measurebyLabelDistribution Entropy

Abandon No_Cart_Interaction Purchase

Figure 4 F1-measures at Entropy Thresholds for each class

05000

1000015000200002500030000350004000045000

099 088 077 066 055 044 033 022 011

NumberofSequences

EntropyThreshold (Proportion ofMaximumEntropy)

SequenceswhereEntropyGoesBelowThreshold

Figure 5 Sequences Crossing Entropy Thresholds

REFERENCES[1] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Greg S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dan Maneacute Rajat Monga Sherry Moore Derek Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar PaulTucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2015 TensorFlow Large-Scale Machine Learning on Heterogeneous Systems(2015) httptensoroworg Software available from tensoroworg

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

[2] Aditya Awalkar Ibrahim Ahmed and Tejas Nevrekar 2016 Prediction of UserrsquosPurchases using Clickstream Data International Journal of Engineering Scienceand Computing (2016)

[3] Yoshua Bengio Reacutejean Ducharme Pascal Vincent and Christian Janvin 2003A Neural Probabilistic Language Model J Mach Learn Res 3 (March 2003)1137ndash1155

[4] Dimitris Bertsimas Adam J Mersereau and Nitin R Patel 2003 DynamicClassication of Online Customers In Proceedings of the Third SIAM InternationalConference on Data Mining San Francisco CA USA May 1-3 2003 107ndash118

[5] Robert Florentin 2016 Online shopping abandonment rate a new perspective therole of choice conicts as a factor of online shopping abandonment Masterrsquos thesisUniversity of Twente

[6] Joshua T Goodman 2001 A Bit of Progress in Language Modeling ComputSpeech Lang 15 4 (Oct 2001) 403ndash434

[7] Bo-June Paul Hsu and James R Glass 2008 Iterative language model estimationecient data structure amp algorithms In INTERSPEECH 2008 9th Annual Confer-ence of the International Speech Communication Association Brisbane AustraliaSeptember 22-26 2008 841ndash844

[8] Monika Kukar-Kinney and Angeline G Close 2010 The determinants of con-sumersrsquo online shopping cart abandonment Journal of the Academy of MarketingScience 38 2 (2010) 240ndash250

[9] Tomaacuteš Mikolov Martin Karaaacutet Lukaacuteš Burget Jan Černockyacute and Sanjeev Khu-danpur 2010 Recurrent neural network based language model In The 11thAnnual Conference of the International Speech Communication Association (IN-TERSPEECH 2010)

[10] Deepak Pai Abhijit Sharang Meghanath Macha Yadagiri and Shradha Agrawal2014 Modelling Visit Similarity Using Click-Stream Data A Supervised ApproachSpringer International Publishing 135ndash145

[11] The Statistics Portal 2014 Digital buyer penetration world-wide from 2014 to 2019 httpwwwstatistacomstatistics261676digital-buyer-penetration-worldwide (2014) Accessed 2016-09-10

[12] Armando Vieira 2015 Predicting online user behaviour using deep learningalgorithms The Computing Research Repository (CoRR) abs151106247 (2015)

[13] Zhenzhou Wu Bao Hong Tan Rubing Duan Yong Liu and Rick Siow Mong Goh2015 Neural Modeling of Buying Behaviour for E-Commerce from ClickingPatterns In Proceedings of the 2015 International ACM Recommender SystemsChallenge (RecSys rsquo15 Challenge) ACM New York NY USA Article 12 4 pages

[14] Zhengzheng Xing Jian Pei and Eamonn Keogh 2010 A Brief Survey on SequenceClassication SIGKDD Explor Newsl 12 1 (Nov 2010) 40ndash48

[15] Yin Xu and Jin-Song Huang 2015 Factors Inuencing Cart Abandonment inthe Online Shopping Process Social Behavior and Personality an internationaljournal 43 10 (Nov 2015)

[16] Wojciech Zaremba Ilya Sutskever and Oriol Vinyals 2014 Recurrent Neural Net-work Regularization The Computing Research Repository (CoRR) abs14092329(2014) httparxivorgabs14092329

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 Clickstream Data
  • 4 Modeling Approaches
    • 41 Mixture of High-Order Markov Chains
    • 42 Recurrent Neural Networks
      • 5 Experiments
        • 51 Results
          • 6 Conclusion and Future Work
          • References
Page 4: Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of shoppers’ behaviors, leveraging features from clickstream data generated during live

SIGIR 2017 eCom August 2017 Tokyo Japan Arthur Toth Louis Tan Giuseppe Di Fabbrizio and Ankur Daa

Table 2 Precision (P) Recall (R) and F1-measure (F) for MCM and RNN mixtures by sequence length

Mixture Type Sequence Length 25 50 75 100

MCM Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 044 002 004 045 009 014 072 052 061 070 064 067 021BROWSING-ONLY 066 099 079 069 098 081 089 097 092 099 095 097 064PURCHASE 037 002 004 042 011 017 072 067 069 071 086 078 015

RNN Final State P R F1 P R F1 P R F1 P R F1 Random

ABANDON 075 039 052 073 062 067 074 072 073 100 084 091 021BROWSING-ONLY 084 099 091 092 099 096 097 099 098 100 098 099 064PURCHASE 080 064 071 082 071 076 082 078 080 086 100 092 015

would score the PURCHASE sequences (solid blue lines) high andthe other sequences low and the earlier the distinction could bemade the better Looking at this gure there does appear to besome level of discrimination between the categories In generalthe BROWSING-ONLY sequences seem more separable from thePURCHASE sequences than the ABANDON sequences as expected

Although Table 2 can be used to compare other work [10] it de-pends on sequence lengths A useful live system must predict nalactions before sequences are complete and needs a decision processfor accepting the predicted label We experimented with takingthe highest scoring label once the entropy fell below a thresholdFigure 4 shows the F1-measures at dierent thresholds which areproportions of the maximum possible entropy Since the entropymight not drop below the threshold it is important to considerhow many sequences have predicted labels When we calculated F1-measures sequences without predictions were counted as missesFigure 5 shows the number of sequences where predictions weremade based on entropy threshold Again the horizontal axis rep-resents the proportion of the maximum possible entropy valueThe vertical axis represents the number of sequences where a de-cision can be made based on threshold crossing As an example athreshold of 055 led to reasonable F1-measures while producingpredictions for 99 of the sequences before they were completeChoosing higher entropy thresholds allows decisions to be madefor more sequences but performance can suer since decisions canbe made while class probabilities are more uniform and condenceis lower Choosing lower entropy thresholds forces the class proba-bilities to be more distinct which leads to more condent decisionsbut performance starts to suer when fewer sequences receivedecisions In practice the threshold would be tuned on held-outdata

6 CONCLUSION AND FUTUREWORKWe presented two models for the real-time early prediction ofshopping session outcomes on an e-commerce platform We demon-strated that LSTM RNNs generalize better and with less data thanhigh-order Markov chain models used in previous work Our ap-proach in addition to distinguishing browsing-only and cart-interactionsessions can also accurately discriminate between cart abandon-ment and purchase sessions Future work will focus on featuressingle RNN architectures and decision strategies

0

02

04

06

08

1

099 088 077 066 055 044 033 022 011

F1-m

easure

EntropyThreshold (Proportion ofMaximumEntropy)

F1-measurebyLabelDistribution Entropy

Abandon No_Cart_Interaction Purchase

Figure 4 F1-measures at Entropy Thresholds for each class

05000

1000015000200002500030000350004000045000

099 088 077 066 055 044 033 022 011

NumberofSequences

EntropyThreshold (Proportion ofMaximumEntropy)

SequenceswhereEntropyGoesBelowThreshold

Figure 5 Sequences Crossing Entropy Thresholds

REFERENCES[1] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Greg S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dan Maneacute Rajat Monga Sherry Moore Derek Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar PaulTucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2015 TensorFlow Large-Scale Machine Learning on Heterogeneous Systems(2015) httptensoroworg Software available from tensoroworg

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

[2] Aditya Awalkar Ibrahim Ahmed and Tejas Nevrekar 2016 Prediction of UserrsquosPurchases using Clickstream Data International Journal of Engineering Scienceand Computing (2016)

[3] Yoshua Bengio Reacutejean Ducharme Pascal Vincent and Christian Janvin 2003A Neural Probabilistic Language Model J Mach Learn Res 3 (March 2003)1137ndash1155

[4] Dimitris Bertsimas Adam J Mersereau and Nitin R Patel 2003 DynamicClassication of Online Customers In Proceedings of the Third SIAM InternationalConference on Data Mining San Francisco CA USA May 1-3 2003 107ndash118

[5] Robert Florentin 2016 Online shopping abandonment rate a new perspective therole of choice conicts as a factor of online shopping abandonment Masterrsquos thesisUniversity of Twente

[6] Joshua T Goodman 2001 A Bit of Progress in Language Modeling ComputSpeech Lang 15 4 (Oct 2001) 403ndash434

[7] Bo-June Paul Hsu and James R Glass 2008 Iterative language model estimationecient data structure amp algorithms In INTERSPEECH 2008 9th Annual Confer-ence of the International Speech Communication Association Brisbane AustraliaSeptember 22-26 2008 841ndash844

[8] Monika Kukar-Kinney and Angeline G Close 2010 The determinants of con-sumersrsquo online shopping cart abandonment Journal of the Academy of MarketingScience 38 2 (2010) 240ndash250

[9] Tomaacuteš Mikolov Martin Karaaacutet Lukaacuteš Burget Jan Černockyacute and Sanjeev Khu-danpur 2010 Recurrent neural network based language model In The 11thAnnual Conference of the International Speech Communication Association (IN-TERSPEECH 2010)

[10] Deepak Pai Abhijit Sharang Meghanath Macha Yadagiri and Shradha Agrawal2014 Modelling Visit Similarity Using Click-Stream Data A Supervised ApproachSpringer International Publishing 135ndash145

[11] The Statistics Portal 2014 Digital buyer penetration world-wide from 2014 to 2019 httpwwwstatistacomstatistics261676digital-buyer-penetration-worldwide (2014) Accessed 2016-09-10

[12] Armando Vieira 2015 Predicting online user behaviour using deep learningalgorithms The Computing Research Repository (CoRR) abs151106247 (2015)

[13] Zhenzhou Wu Bao Hong Tan Rubing Duan Yong Liu and Rick Siow Mong Goh2015 Neural Modeling of Buying Behaviour for E-Commerce from ClickingPatterns In Proceedings of the 2015 International ACM Recommender SystemsChallenge (RecSys rsquo15 Challenge) ACM New York NY USA Article 12 4 pages

[14] Zhengzheng Xing Jian Pei and Eamonn Keogh 2010 A Brief Survey on SequenceClassication SIGKDD Explor Newsl 12 1 (Nov 2010) 40ndash48

[15] Yin Xu and Jin-Song Huang 2015 Factors Inuencing Cart Abandonment inthe Online Shopping Process Social Behavior and Personality an internationaljournal 43 10 (Nov 2015)

[16] Wojciech Zaremba Ilya Sutskever and Oriol Vinyals 2014 Recurrent Neural Net-work Regularization The Computing Research Repository (CoRR) abs14092329(2014) httparxivorgabs14092329

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 Clickstream Data
  • 4 Modeling Approaches
    • 41 Mixture of High-Order Markov Chains
    • 42 Recurrent Neural Networks
      • 5 Experiments
        • 51 Results
          • 6 Conclusion and Future Work
          • References
Page 5: Predicting Shopping Behavior with Mixture of RNNsceur-ws.org/Vol-2311/paper_6.pdf · tion of shoppers’ behaviors, leveraging features from clickstream data generated during live

Predicting Shopping Behavior with Mixture of RNNs SIGIR 2017 eCom August 2017 Tokyo Japan

[2] Aditya Awalkar Ibrahim Ahmed and Tejas Nevrekar 2016 Prediction of UserrsquosPurchases using Clickstream Data International Journal of Engineering Scienceand Computing (2016)

[3] Yoshua Bengio Reacutejean Ducharme Pascal Vincent and Christian Janvin 2003A Neural Probabilistic Language Model J Mach Learn Res 3 (March 2003)1137ndash1155

[4] Dimitris Bertsimas Adam J Mersereau and Nitin R Patel 2003 DynamicClassication of Online Customers In Proceedings of the Third SIAM InternationalConference on Data Mining San Francisco CA USA May 1-3 2003 107ndash118

[5] Robert Florentin 2016 Online shopping abandonment rate a new perspective therole of choice conicts as a factor of online shopping abandonment Masterrsquos thesisUniversity of Twente

[6] Joshua T Goodman 2001 A Bit of Progress in Language Modeling ComputSpeech Lang 15 4 (Oct 2001) 403ndash434

[7] Bo-June Paul Hsu and James R Glass 2008 Iterative language model estimationecient data structure amp algorithms In INTERSPEECH 2008 9th Annual Confer-ence of the International Speech Communication Association Brisbane AustraliaSeptember 22-26 2008 841ndash844

[8] Monika Kukar-Kinney and Angeline G Close 2010 The determinants of con-sumersrsquo online shopping cart abandonment Journal of the Academy of MarketingScience 38 2 (2010) 240ndash250

[9] Tomaacuteš Mikolov Martin Karaaacutet Lukaacuteš Burget Jan Černockyacute and Sanjeev Khu-danpur 2010 Recurrent neural network based language model In The 11thAnnual Conference of the International Speech Communication Association (IN-TERSPEECH 2010)

[10] Deepak Pai Abhijit Sharang Meghanath Macha Yadagiri and Shradha Agrawal2014 Modelling Visit Similarity Using Click-Stream Data A Supervised ApproachSpringer International Publishing 135ndash145

[11] The Statistics Portal 2014 Digital buyer penetration world-wide from 2014 to 2019 httpwwwstatistacomstatistics261676digital-buyer-penetration-worldwide (2014) Accessed 2016-09-10

[12] Armando Vieira 2015 Predicting online user behaviour using deep learningalgorithms The Computing Research Repository (CoRR) abs151106247 (2015)

[13] Zhenzhou Wu Bao Hong Tan Rubing Duan Yong Liu and Rick Siow Mong Goh2015 Neural Modeling of Buying Behaviour for E-Commerce from ClickingPatterns In Proceedings of the 2015 International ACM Recommender SystemsChallenge (RecSys rsquo15 Challenge) ACM New York NY USA Article 12 4 pages

[14] Zhengzheng Xing Jian Pei and Eamonn Keogh 2010 A Brief Survey on SequenceClassication SIGKDD Explor Newsl 12 1 (Nov 2010) 40ndash48

[15] Yin Xu and Jin-Song Huang 2015 Factors Inuencing Cart Abandonment inthe Online Shopping Process Social Behavior and Personality an internationaljournal 43 10 (Nov 2015)

[16] Wojciech Zaremba Ilya Sutskever and Oriol Vinyals 2014 Recurrent Neural Net-work Regularization The Computing Research Repository (CoRR) abs14092329(2014) httparxivorgabs14092329

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 Clickstream Data
  • 4 Modeling Approaches
    • 41 Mixture of High-Order Markov Chains
    • 42 Recurrent Neural Networks
      • 5 Experiments
        • 51 Results
          • 6 Conclusion and Future Work
          • References