Download - Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

Transcript
Page 1: Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

Deep Joint Embeddings of Context and Contentfor Recommendation

Miklas S. Kristoffersen∗Bang & Olufsen A/S

BBC R&DAalborg University

[email protected]

Jacob L. WielandBBC

London, United [email protected]

Sven E. ShepstoneBang & Olufsen A/SStruer, Denmark

[email protected]

Zheng-Hua TanAalborg UniversityAalborg, Denmark

[email protected]

Vinoba VinayagamoorthyBBC R&D

London, United [email protected]

ABSTRACTThis paper proposes a deep learning-based method for learningjoint context-content embeddings (JCCE) with a view to context-aware recommendations, and demonstrate its application in thetelevision domain. JCCE builds on recent progress within latentrepresentations for recommendation and deep metric learning. Themodel effectively groups viewing situations and associated con-sumed content, based on supervision from 2.7 million viewingevents. Experiments confirm the recommendation ability of JCCE,achieving improvements when compared to state-of-the-art meth-ods. Furthermore, the approach shows meaningful structures in thelearned representations that can be used to gain valuable insights ofunderlying factors in the relationship between contextual settingsand content properties.

KEYWORDSContext-Aware Recommender Systems, Deep Learning, Television.

1 INTRODUCTIONRecommender systems (RS) have evolved tremendously in recentyears thanks to widespread attention in both academia and indus-try [1, 17, 25]. One of the major developments has been the intro-duction of deep learning-based methods, which have demonstratedsuperior performance in numerous applications [28]. These deepmethods have proven to be able to effectively capture nonlinearand nontrivial relationships in the interactions between users anditems. Beyond users and items, context-aware RS (CARS) have seennew exciting opportunities. The ability of deep neural networks tolearn underlying explanatory factors and low-dimensional repre-sentations from sparse input data with a large number of attributes,is key to achievements of recent latent CARS, e.g. as demonstratedin an unsupervised setting in [22]. In combination with the richdescriptive information of user-item interactions that is availablein several modern real-world applications, these methods providea promising way forward in the endeavor to understand users andprovide them with context-aware recommendations.

∗The work was done in part while the author was visiting BBC R&D in London, UK. Hiswork is supported by the Innovation Fund Denmark (IFD) under File No. 5189-00009B.

Enco

der

Enco

der

ContextUser ID(s),Time of day,Age,⋮

ContentProgram ID,Genre,Duration,⋮

Similarity

C

I

ϕC(C)

ϕI(I)

Learned Latent Space

Content

Context

Embeddings

Figure 1: Framework for learning a shared latent space inorder to compute similarities between context and content.

Television exemplifies one such application in which contentdelivery may benefit from intelligently adapting to viewing situ-ations, and where insights into user behavior are of high interestto stakeholders, such as content creators, schedulers, and advertis-ers. Broadcasters’ Audience Research Board (BARB)1 maintains apanel of households in the UK that represent the television viewingacross the nation. Each household in the panel is equipped withmeasuring devices, often referred to as meters, associated with eachtelevision in the home. These meters offer a semi-automatic wayfor participants to report who has been watching what and when.2The data contain a large number of interactions compared to alter-native data collection methods within CARS for television, suchas self-reported consumption [11]. The large volume of reportedviewing events allows data-driven learning of underlying structuresas has been explored in related literature on e.g. implicit preferenceelicitation [5] and group viewing patterns [2, 15]. It also motivatesthe deep learning-based methods studied in this work, where weare specifically interested in providing accurate recommendationsbased on contextual settings of viewing situations, while also learn-ing representations that can help explore the complex patterns oftelevision consumption.

To this end, we turn our attention to the framework in Fig. 1which shows the principles behind the contributions of this paper.The core foundation is the embedding of context and content in1https://www.barb.co.uk.2Presence in front of the television is registered manually using a remote control andsupports multiple simultaneous users. All other variables are collected automatically.

arX

iv:1

909.

0607

6v1

[cs

.IR

] 1

3 Se

p 20

19

Page 2: Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

a shared latent space. Here context is defined as the collection ofvariables specifying a viewing situation, such as user identities andtemporal aspects, and content represents programs and associatedmetadata. Bringing both context and content into the same latentspace allows for significant advantages, such as computing simi-larities across domains, and ideally enables us to learn to groupcontexts with relevant content while also grouping contexts thatshow similar preferences in terms of content, and content that tendto be consumed in similar contexts. Also, content embeddings arestatic at serving time and only have to be computed once (until themodel is updated, e.g. with new content), which allows for efficientrecommendations – a forward pass of the context encoder and sim-ilarity computations with the precomputed content embeddings.

We propose the joint context-content embeddings (JCCE)methodwhich learns a metric to encode contextual preferences of content.Compared to other methods, such as Wide & Deep [3], we keepcontext and content encoding distinct, thereby allowing efficientinference as well as independent investigations of each domain.By using the N -pairs mini-batch loss [18], we train both encoderssimultaneously to optimize a joint objective. The approach shouldapply to generic RS, but we focus on its application within thetelevision domain. A dataset consisting of 2.7M viewing events isused to demonstrate the recommender performance of JCCE. Inaddition to the results obtained for recommendation, JCCE enablesqualitative insights into the relationships among contextual settingsand consumed content properties. In Section 3.4, we highlight thiscapability by displaying visualizations of embeddings, and usingthose we show meaningful structures in the learned latent space.

Related Work on Latent RS and CARS. Related research in RS hasinvestigated embeddings as a powerful tool for recommendation.Early work for learning user and item latent representations withneural networks was proposed in [14]. Recently, the success ofWord2Vec [12] inspired the product embeddings from purchasesequences in Prod2Vec [6], which was expanded to include productside information in Meta-Prod2Vec [24]. [7] shows competitive re-sults in a Kaggle competition using simple mappings of categoricalvariables to dense vector representations. They use concatenatedentity embeddings and continuous variables as input for a neuralnetwork. In [21], item and event embeddings are introduced forsession-aware recommendations. Collaborative metric learning iscoined in [9] and demonstrates learning of a joint user-item metric.[4] presents a real-world implementation for video content, whileWide & Deep [3] proves efficient for recommending apps. Deep-Cross [16] use a similar framework, but they replace the standardfeed-forward network with a residual network. Neural Factoriza-tion Machines (NFM) [8] combines the successful FactorizationMachines (FM) [13] with neural network architectures. Convolu-tional FM (CFM) [26] uses 3D convolutional neural networks tomodel high-order interactions between contextual variables.

2 PROPOSED APPROACHWe define a viewing event as a contextual setting, e.g. temporalaspects such as the time of day, together with consumed content, e.g.genre. Our goal is to learn joint embeddings of context and content,such that the representations of a true pair are close together, andthose pairs that are unlikely to co-occur are far apart. As an example,

when a child is watching, children’s content should be preferableto horror movies by having a smaller distance to the context inthe latent space. Since the data collection method relies on implicitfeedback, undesired correlations between content and context aregenerally not present in the data. As an example, there are nonegative examples to indicate that children should not watch horrormovies. We thus have to carefully consider how we formulate thelearning objective and sampling strategy. For the remaining partof this work, we will present how JCCE is designed to overcomethese challenges and achieve the goal described above.

Viewing events are logged as a pair (I ,C) of consumed contentand contextual features, respectively. The specific data used in thispaper is described in Section 3.1. Let ϕI (I ) be an embedding of thecontent, where ϕI is the content encoding function shown by thered block in Fig. 1. Furthermore, let ϕC (C) be an embedding ofthe context, such that the dimensionality of ϕI (I ) and ϕC (C) areidentical. This allows the embeddings to exist in a shared latentspace with the possibility to compute similarity scores.

2.1 Encoder ArchitectureA contextual setting, C , is a sparse high-dimensional collectionof numerical and categorical features describing the aspects of aviewing situation. These aspects include information about whois watching television as well as when and where it takes place.Thus, one of the tasks of the context encoder is to combine thevarious contextual aspects of a viewing event into a single pointin the latent space such that the low-dimensional representationeffectively embodies crucial characteristics of the situation at hand.

In order to embed vectors of context and content, we introducethe two encoders ϕI : R |I | → RE and ϕC : R |C | → RE , where Eis the dimensionality of the embeddings. As in [24], we set E = 50based on empirical findings, and do not investigate effects of chang-ing E further, but refer the reader to related literature, e.g. [27].We employ nonlinear encoder networks consisting of three fullyconnected layers each3. The first two layers are each defined tohave 250 rectified linear units (ReLUs), and the last layer is a simplelinear transformation with 50 units.

2.2 Jointly Training Context-Content EncodersWe follow a supervised approach for learning the weights of the twoencoders simultaneously. Specifically, we use the N -pairs loss [18]:

LNP(X,Y) = − 1N

N∑ilog ©­« exi

Tyi∑Nj exi

Tyj

ª®¬ + λ(∥xi ∥2 + ∥yi ∥2

), (1)

where N is the number of pairs in a mini-batch, λ is the regulariza-tion strength, andX = {x1, . . . ,xN } andY = {y1, . . . ,yN } are setsof E-dimensional vectors xi ,yi ∈ RE such that (xi ,yi ) is a pair andxi , x j for all i, j = 1, . . . ,N with i , j. Thus, each xi is comparedto one positive example, yi , and N − 1 negative examples, yj,i .

For the purpose of training ϕI and ϕC jointly, we define N tobe the number of context-content pairs in each mini-batch. Thecontent samples follow the notation I = {I1, . . . , IN }. Likewise forcontext samples C = {C1, . . . ,CN }. Each pair (Ii ,Ci ) is a viewing

3Note that the two encoders are not required to have identical architectures, but inthis work we have opted for similar structures.

Page 3: Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

event observed in the training data, and the N pairs in a mini-batchare selected such that I contains N unique content types. Thesepairs are weighted equally, but a natural extension is to considervarying confidence levels, e.g. using duration of the event. Wedefine the JCCE training objective, for efficacy, as minimizing theasymmetric N -pairs loss of the embeddings in both directions:

L(I,C,ϕI ,ϕC ) = LNP(ϕI (I),ϕC (C)) + LNP(ϕC (C),ϕI (I)). (2)

In addition to grouping context-content pairs, this process will alsocluster 1) the contextual settings exhibiting similar preferences interms of consumed content, and 2) the content types consumedin similar contexts. Each of these properties can be accessed inde-pendently using the respective encoding function, and ultimatelyallows us to generalize the model to relationships between unseencontext-content pairs.

2.3 RecommendationsThe framework allows us to recommend content based on the con-textual settings of a specific viewing situation. That is, if we knowthe context,C , we can compute a score for some available content, I ,according to cosine similarityS(I ,C) = ϕI (I )TϕC (C)/|ϕI (I )| |ϕC (C)|.The resulting recommendation is a list of all available content sortedwith decreasing similarity score to the given viewing context.

2.4 Linear JCCEIn addition to the standard JCCE described above, we also show theperformance of linear JCCE (L-JCCE); A fully context-aware con-figuration that uses the loss defined in Eq. (2) with linear encoders,ϕI and ϕC . Specifically, ϕI (Ii ) =WI Ii + bI withWI ∈ RE×|Ii | andsimilarly ϕC (Ci ) =WCCi + bC withWC ∈ RE×|Ci | .

3 EXPERIMENTSIn this section we conduct experiments with two main purposes;1) Quantitative evaluation of recommender performance of JCCEand four baseline methods; 2) Qualitative insights through visual-izations of JCCE learned embeddings.

3.1 SetupThe proprietary dataset used for the experiments comprises ap-proximately two months of television viewing within the BARBpanel in the period June to July 2018. During that time frame, 5923households encompassing 13K unique panel members reported atleast once. We include 11 attributes of viewing events and removeall viewing with a duration of less than three minutes, since we canassume that users did not engage with the content if they watchedit for less than that (also used in the official reach figures by BARB).We also remove viewing of content with few total observations,which reduce the number of viewing events from 4M to 2.7M. Weseparate the first 90% of events into a set for training and the remain-ing 10% into a set for evaluation. The test set covers approximatelyone week.

Television content distinguishes itself from popular recommenderdomains, such as movies, in several aspects. Most notably, televi-sion content catalogs are time-constrained and dynamic [20]. Inthe present contribution we focus on high-level recommendationsof content genres, since these pose as robust descriptors that do

not suffer to the same degree as specific programs from the rapidlychanging catalog. The reduced BARB dataset contains 94 genresfrom 13 top-level genres, e.g. regional under news.

We train the model using the Adam optimizer [10] with earlystopping, and apply dropout [19] to encourage contributions frommore contextual variables and hence reduce overfitting decisionstowards a limited selection of input features.

The evaluation is focused on the ability to recommend content(i.e. genres among the 94 available) given contextual settings. Forthis purpose we report hit ratio (HR@K) and mean reciprocal rank(MRR). That is, for each test sample a hit is achieved if the targetcontent is within the top-K recommendations. The resulting HR@Kis the hit ratio over all test samples. MRR is the average reciprocalplacement of targets in the recommendation lists.

3.2 Baseline Methods3.2.1 Random. A weak baseline that randomly ranks content andmainly serves as an indicator of the chances of coincidental hits.

3.2.2 Toppop. A context agnostic baseline that ranks content ac-cording to the number of observations in the training set. It alsoserves as a measure of dominance among the most popular contentcompared to the less watched.

3.2.3 Toppop (temp). Temporal context is a key indicator due tothe strong habitual preferences in everyday television consump-tion [20]. A simple, but often well-performing, baseline ranks con-tent according to occurrences at specific temporal settings.

3.2.4 Wide & Deep [3]. A state-of-the-art framework consistingof a wide component for memorization and a deep component forgeneralization. We use two cross-product transformations for thewide component: 1) user IDs and genre; 2) temporal settings (day ofweek, time of day) and genre. The deep component uses all featuresand the architecture is chosen to be similar to that of the JCCEencoders. The model is trained using a logistic loss with observedcontext-content pairs as positive examples and a similar number ofrandomly sampled unseen pairs as negative examples.

3.3 ResultsTable 1 compares performances of the methods. Firstly, note thehit ratio scores of Toppop suggesting that the most seen genre inthe training set accounts for 8% in the test set, while the three mostpopular genres make up a total of 18.7% of the total viewing. Thereis a notable performance gain when taking temporal aspects intoaccount in Toppop. It is also evident that L-JCCE is less accurate

Table 1: Quantitative results.

Method HR@1 HR@3 MRR

Random 0.011 0.032 0.055Toppop 0.080 0.187 0.199Toppop (temp) 0.120 0.282 0.262L-JCCE 0.222 0.434 0.376Wide & Deep 0.264 0.467 0.409JCCE 0.293 0.506 0.443

Page 4: Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

1 3 5 7 9 11 13 15K

0.0

0.2

0.4

0.6

0.8

HR@

K

JCCEWide & DeepL-JCCE

Toppop (temp)ToppopRandom

Figure 2: Hit ratio performance at increasing K.

than its nonlinear counterpart, most likely due to a lower capacityboth in terms of the number of parameters and the inability tomodel nonlinear interactions. JCCE outperformsWide & Deep withrelative improvements of 11%, 8.4%, and 8.3% for HR@1, HR@3,and MRR, respectively. The difference is partially explained by thetraining procedure, where JCCE utilizes the N -pairs strategy, whileWide & Deep relies on the point-wise logistic loss.

Fig. 2 shows how the methods perform in terms of hit ratiofor different settings of K. As can be seen, JCCE shows consistentimprovements compared to the other methods, which has beenfurther inspected by conducting McNemar’s tests, verifying thatall improvements are statistically significant with p < 0.001.

3.4 Analysis of Learned RepresentationsIn this section, we take a qualitative look at how contextual settingsand content group in the learned latent space. To this end, a largeset of viewing events are randomly sampled from the test set, suchthat they are not seen during the training phase. We then compute

their embeddings using the context encoder, ϕC . In Fig. 3 we haveused JCCE and reduced the embeddings from 50 to 2 dimensionsfor visualization with t-SNE [23] using cosine similarity. We alsoinclude content embeddings from each unique genre using ϕI .

The three plots of Fig. 3 show the same embeddings coloredaccording to three different variables of the viewing events. Thefirst one, Fig. 3a, colors the context embeddings from the associatedtarget content genre. Note e.g. the cluster of contexts with children’scontent, and how news/weather and current affairs group togetherin the top-right, suggesting that they tend to be consumed in similarcontextual settings. Furthermore, the content embeddings of thesegenres are located in the same areas. The second plot is relatedto the first, but instead of showing ground-truth genre choice itdisplays recommended genre by JCCE, and serves as a qualitativetool to assess the performance and types of errors made by themodel. The 3b plot of ideal RS would thus be similar to the 3a plot.The last plot, Fig. 3c, clearly demonstrates the strong temporalinfluence on the learned representations. Combined with morecontextual variables, e.g. social, these embedding visualizationsenable valuable insights, whether investigating patterns of contextor content.

4 CONCLUSIONIn this work, we explored deep embeddings learned jointly for con-text and content. We introduced the unified framework of JCCEthat delivers context-aware recommendations, while also supplyingtools for exploring patterns of context-content, context-context,and content-content relationships. We demonstrated the capabilityof JCCE by achieving superior performance for recommendationsin the television domain, and inspected the interpretability by vi-sualizing learned structures in the shared latent space. For futurework we will explore the shared latent space, as well as evaluatethe approach against more baseline methods on publicly availabledatasets suitable for CARS.

ArtsChildrenCurrent AffairsDocumentariesDrama

EntertainmentFilmsHobbies/LeisureMusic

News/WeatherOtherReligiousSport

(a) Observed genre (b) Recommended genre

MorningMidday

Early EveningPrime Evening

Late EveningNight

(c) Time of day

Figure 3: Embeddings of the JCCEmodel visualized using t-SNE. The small points are context embeddings of randomly selectedsamples from the test set colored by (a) associated ground-truth genre, (b) genre that achieves the highest cosine similaritywith the context in the 50-dimensional space, and (c) time of day. Each large point is a content embedding of a sub genre (e.g.History is a sub genre of Documentaries), and follows the color scheme of (a). The figures are zoomable.

Page 5: Deep Joint Embeddings of Context and Contentfor ... - arXivWord2Vec [12] inspired the product embeddings from purchase sequences in Prod2Vec [6], which was expanded to include product

REFERENCES[1] Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the Next Gener-

ation of Recommender Systems: A Survey of the State-of-the-Art and PossibleExtensions. IEEE Transactions on Knowledge and Data Engineering 17, 6 (June2005), 734–749. https://doi.org/10.1109/TKDE.2005.99

[2] Allison J.B. Chaney, Mike Gartrell, Jake M. Hofman, John Guiver, Noam Koenig-stein, Pushmeet Kohli, and Ulrich Paquet. 2014. A Large-Scale Exploration ofGroup Viewing Patterns. In Proceedings of the ACM International Conference onInteractive Experiences for TV and Online Video (TVX ’14). ACM, New York, NY,USA, 31–38. https://doi.org/10.1145/2602299.2602309

[3] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, RohanAnil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems (DLRS 2016). ACM, 7–10.https://doi.org/10.1145/2988450.2988454

[4] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networksfor YouTube Recommendations. In Proceedings of the 10th ACM Conference onRecommender Systems (RecSys ’16). ACM, New York, NY, USA, 191–198. https://doi.org/10.1145/2959100.2959190

[5] Sandra Clara Gadanho and Nicolas Lhuillier. 2007. Addressing Uncertainty inImplicit Preferences. In Proceedings of the 2007 ACM Conference on RecommenderSystems (RecSys ’07). ACM, New York, NY, USA, 97–104. https://doi.org/10.1145/1297231.1297248

[6] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati,Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-Commerce in Your Inbox:Product Recommendations at Scale. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (KDD ’15).ACM, New York, NY, USA, 1809–1818. https://doi.org/10.1145/2783258.2788627

[7] Cheng Guo and Felix Berkhahn. 2016. Entity Embeddings of Categorical Variables.arXiv:1604.06737 [cs] (April 2016). arXiv:cs/1604.06737

[8] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th InternationalConference on World Wide Web (WWW ’17). International World Wide WebConferences Steering Committee, Republic and Canton of Geneva, Switzerland,173–182. https://doi.org/10.1145/3038912.3052569

[9] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, andDeborah Estrin. 2017. Collaborative Metric Learning. In Proceedings of the 26thInternational Conference on World Wide Web (WWW ’17). International WorldWide Web Conferences Steering Committee, Republic and Canton of Geneva,Switzerland, 193–201. https://doi.org/10.1145/3038912.3052639

[10] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-mization. arXiv:1412.6980 [cs] (Dec. 2014). arXiv:cs/1412.6980

[11] Miklas S. Kristoffersen, Sven E. Shepstone, and Zheng-Hua Tan. 2018. The Im-portance of Context When Recommending TV Content: Dataset and Algorithms.arXiv:1808.00337 [cs, stat] (July 2018). arXiv:cs, stat/1808.00337

[12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. EfficientEstimation of Word Representations in Vector Space. arXiv:1301.3781 [cs] (Jan.2013). arXiv:cs/1301.3781

[13] S. Rendle. 2010. Factorization Machines. In 2010 IEEE International Conference onData Mining. 995–1000. https://doi.org/10.1109/ICDM.2010.127

[14] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltz-mannMachines for Collaborative Filtering. In Proceedings of the 24th InternationalConference on Machine Learning (ICML ’07). ACM, New York, NY, USA, 791–798.https://doi.org/10.1145/1273496.1273596

[15] Christophe Senot, Dimitre Kostadinov, Makram Bouzid, Jérôme Picault, ArmenAghasaryan, and Cédric Bernier. 2010. Analysis of Strategies for Building GroupProfiles. In Proceedings of the 18th International Conference on User Modeling,Adaptation, and Personalization (UMAP ’10). Springer Berlin Heidelberg, 40–51.

[16] Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016.Deep Crossing: Web-Scale Modeling Without Manually Crafted CombinatorialFeatures. In Proceedings of the 22Nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,255–262. https://doi.org/10.1145/2939672.2939704

[17] Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative Filtering Beyondthe User-Item Matrix: A Survey of the State of the Art and Future Challenges.ACM Comput. Surv. 47, 1 (May 2014), 3:1–3:45. https://doi.org/10.1145/2556270

[18] Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-Class N-PairLoss Objective. In Advances in Neural Information Processing Systems 29. CurranAssociates, Inc., 1857–1865.

[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks fromOverfitting. Journal of Machine Learning Research 15 (2014), 1929–1958.

[20] Roberto Turrin, Andrea Condorelli, Paolo Cremonesi, and Roberto Pagano. 2014.Time-Based TV Programs Prediction. In 1st Workshop on Recommender Systemsfor Television and Online Video at ACM RecSys. 7.

[21] Bartłomiej Twardowski. 2016. Modelling Contextual Information in Session-Aware Recommender Systems with Neural Networks. In Proceedings of the 10thACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY, USA,273–276. https://doi.org/10.1145/2959100.2959162

[22] Moshe Unger, Ariel Bar, Bracha Shapira, and Lior Rokach. 2016. Towards LatentContext-Aware Recommendation Systems. Knowledge-Based Systems 104 (July2016), 165–178. https://doi.org/10.1016/j.knosys.2016.04.020

[23] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data UsingT-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.

[24] Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec:Product Embeddings Using Side-Information for Recommendation. In Proceedingsof the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, NewYork, NY, USA, 225–232. https://doi.org/10.1145/2959100.2959160

[25] Douglas Véras, Thiago Prota, Alysson Bispo, Ricardo Prudêncio, and CarlosFerraz. 2015. A Literature Review of Recommender Systems in the TelevisionDomain. Expert Systems with Applications 42, 22 (Dec. 2015), 9046–9076. https://doi.org/10.1016/j.eswa.2015.06.052

[26] Xin Xin, Bo Chen, Xiangnan He, Dong Wang, Yue Ding, and Joemon Jose. 2019.CFM: Convolutional Factorization Machines for Context-Aware Recommenda-tion. In 28th International Joint Conference on Artificial Intelligence (IJCAI’19).

[27] Zi Yin and Yuanyuan Shen. 2018. On the Dimensionality of Word Embedding. InAdvances in Neural Information Processing Systems 31. Curran Associates, Inc.,887–898.

[28] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning BasedRecommender System: A Survey and New Perspectives. ACM Comput. Surv. 52,1 (Feb. 2019), 5:1–5:38. https://doi.org/10.1145/3285029