CAREER: Structured Output Models of Recommendations...

27
CAREER: Structured Output Models of Recommendations, Activities, and Behavior: project summary Predictive models of human behavior underlie many of the most important computing systems in science and industry. Examples abound that seek to suggest friendship links, recommend content, esti- mate reactions, and forecast activities. Across years of research, and dozens of high-profile applications, these systems almost invariably follow a common structure: (a) collect large volumes of historical ac- tivity traces; (b) identify some target variable to be predicted (what will be clicked? what will be liked? who will be befriended? etc.); and (c) train supervised learning approaches to predict this output as accurately as possible. In this proposal, we seek to investigate a fundamentally new modality of recommender systems and personalized models that are capable of estimating rich, structured outputs. That is, rather than predicting clicks, likes, or star ratings, we can build systems that are capable of answering complex questions about content, estimating nuanced reactions in the form of text, or even designing the content itself in order to achieve a specific reaction. Our plan to achieve these goals builds upon traditional recommender systems approaches (that esti- mate models of users and content from huge datasets of historical activities), as well as newly emerging methods for generative and structured output modeling. At the intersection of these ideas shall be a new suite of generative models whose outputs can be personalized to individual users, and equivalently, new forms of recommender systems that can generate rich, structured outputs. We realize this contribution via three research thrusts. In each, we seek to extend work on gener- ative and structured output modeling to allow for personalized prediction and recommendation. In our first thrust, we consider knowledge extraction and Q/A systems, where our goal is to answer questions that depend on subjective, personal experience. Next, we focus on personalized generative models of sequences and text, in order to estimate individuals’ nuanced reactions to content. Finally, we con- sider personalized Generative Adversarial models, that allow us to address the problem of personalized content creation and design, i.e., to suggest new content that would elicit a certain response, if it were created. Intellectual merit. Methodologically, the above contribution extends each of the research areas upon which it builds: Predictive models of human behavior (and in particular recommender systems), and generative models/structured output prediction (in particular, GANs, LSTMs, etc.). Closing the research gap between these areas enables a host of exciting new applications dealing with structured knowledge, text, and images, and places us at the forefront of a newly emerging research area. Broader impact. Scientific: Our research can be used to develop models for a range of structured output tasks where the ‘user’ explains substantial variability in the data. Such applications are ubiquitous in do- mains ranging from linguistics, social science, personalized healthcare (etc.). Specifically, our research shall be supported through collaborations with the UCSD School of Medicine, to investigate systems for automated patient intake and triage, by predicting the severity of symptoms from automated Q/A dialogs (Thrust 1), and with the UCSD Cognitive Science Department to develop models of heartrate data, to build systems capable of designing personalized exercise routines (Thrust 2). Industrial: Personalized, predictive models of human behavior are of broad industrial interest, as drivers for applications in e-Commerce, social media, and recommender systems (etc.). This proposal is supported by collaborations with , to study personalized models of artistic preferences, and , to study systems for personalized Visual Question Answering (Thrust 3). Educational: We are developing the curriculum for UCSD’s new Data Science major and minor, and are creating new coursework that aims to expose graduate and undergraduate students to cutting-edge research. We also regularly organize community workshops, and are involved in mentorship activi- ties aimed at fostering the retention and involvement of groups including URMs, high-schoolers, and engineering professionals. 1

Transcript of CAREER: Structured Output Models of Recommendations...

Page 1: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

CAREER: Structured Output Models of Recommendations, Activities, and Behavior:project summary

Predictive models of human behavior underlie many of the most important computing systems inscience and industry. Examples abound that seek to suggest friendship links, recommend content, esti-mate reactions, and forecast activities. Across years of research, and dozens of high-profile applications,these systems almost invariably follow a common structure: (a) collect large volumes of historical ac-tivity traces; (b) identify some target variable to be predicted (what will be clicked? what will be liked?who will be befriended? etc.); and (c) train supervised learning approaches to predict this output asaccurately as possible.

In this proposal, we seek to investigate a fundamentally new modality of recommender systemsand personalized models that are capable of estimating rich, structured outputs. That is, rather thanpredicting clicks, likes, or star ratings, we can build systems that are capable of answering complexquestions about content, estimating nuanced reactions in the form of text, or even designing the contentitself in order to achieve a specific reaction.

Our plan to achieve these goals builds upon traditional recommender systems approaches (that esti-mate models of users and content from huge datasets of historical activities), as well as newly emergingmethods for generative and structured output modeling. At the intersection of these ideas shall be a newsuite of generative models whose outputs can be personalized to individual users, and equivalently, newforms of recommender systems that can generate rich, structured outputs.

We realize this contribution via three research thrusts. In each, we seek to extend work on gener-ative and structured output modeling to allow for personalized prediction and recommendation. In ourfirst thrust, we consider knowledge extraction and Q/A systems, where our goal is to answer questionsthat depend on subjective, personal experience. Next, we focus on personalized generative models ofsequences and text, in order to estimate individuals’ nuanced reactions to content. Finally, we con-sider personalized Generative Adversarial models, that allow us to address the problem of personalizedcontent creation and design, i.e., to suggest new content that would elicit a certain response, if it werecreated.Intellectual merit. Methodologically, the above contribution extends each of the research areas uponwhich it builds: Predictive models of human behavior (and in particular recommender systems), andgenerative models/structured output prediction (in particular, GANs, LSTMs, etc.). Closing the researchgap between these areas enables a host of exciting new applications dealing with structured knowledge,text, and images, and places us at the forefront of a newly emerging research area.Broader impact. Scientific: Our research can be used to develop models for a range of structured outputtasks where the ‘user’ explains substantial variability in the data. Such applications are ubiquitous in do-mains ranging from linguistics, social science, personalized healthcare (etc.). Specifically, our researchshall be supported through collaborations with the UCSD School of Medicine, to investigate systems forautomated patient intake and triage, by predicting the severity of symptoms from automated Q/A dialogs(Thrust 1), and with the UCSD Cognitive Science Department to develop models of heartrate data, tobuild systems capable of designing personalized exercise routines (Thrust 2).

Industrial: Personalized, predictive models of human behavior are of broad industrial interest, asdrivers for applications in e-Commerce, social media, and recommender systems (etc.). This proposalis supported by collaborations with , to study personalized models of artistic preferences, and

, to study systems for personalized Visual Question Answering (Thrust 3).Educational: We are developing the curriculum for UCSD’s new Data Science major and minor, and

are creating new coursework that aims to expose graduate and undergraduate students to cutting-edgeresearch. We also regularly organize community workshops, and are involved in mentorship activi-ties aimed at fostering the retention and involvement of groups including URMs, high-schoolers, andengineering professionals.

1

Page 2: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Summary: We seek to build the next generation of personalized recommender systems that supportpredictions in the form of rich, structured outputs. We seek systems that are more interactive, byresponding to complex queries, more contextual, by estimating detailed reactions to content, and morecreative, by suggesting what content should be created, in order to elicit a specific response. Wecombine ideas from personalized and context-aware recommender systems (the main research areaof the PI’s lab [1–27]), with recent advances on generative and structured output modeling. Doingso allows us to build behavioral models and recommender systems whose outputs include answers tocomplex questions, text, or images—representing a fundamental advance over existing models that arelimited to predicting simple numerical or categorical values. Building these models helps to advancetraditional applications of recommender systems, and has broader applications wherever individualsinteract with complex content, including medicine, e-commerce, and personalized health.

1 IntroductionEvery day we interact with predictive systems that seek to model our behavior, monitor our activities,

and make recommendations: Whom will we befriend [25]? What articles will we like [26]? Whoinfluences us in our social network [19]? And do our activities change over time [23]? Models thatanswer such questions drive important real-world systems, and at the same time are of basic scientificinterest to economists, linguists, and social scientists, among others.

Despite their success in a field with many important applications, the fundamental structure of rec-ommender systems has not changed in many years: all of the above questions boil down to regression,classification, and retrieval. This limits the type of queries that can be addressed, and several importantproblems remain open. For example, methods based on regression and retrieval cannot estimate how auser would react to content (e.g. what would they say?), or what type of content should be created inorder to achieve a certain reaction. Such questions are fundamentally more challenging, as they dependon generating complex, structured outputs and cannot be solved by existing predictive models.

(I) Structured inputs

(O) Structured outputs (R) Behavioral models &Recommender Systems

Classifiers(CNNs, etc.)

Content &

Context-Aware RecSysGenerative models

(GANs, LSTMs, etc.)

(Cond.) GANs

Encoder-Decoder RNNs

Q/A & Dialog Systems

Thisproposal:

Rich-input,Rich-outputRecommenderSystems

T1: interactivityT2: context

T3: creativity

"Traditional" RecSys

(CF & MF, BPR, etc.)

Figure 1: We seek to leverage recent work on genera-tive and structured output modeling (O) with recom-mender systems techniques (R), including those thatmake use of structured and multimodal inputs (I).

In this proposal, we seek to develop a new area of rich-in-put, rich-output recommender systems. Our research seeks tofacilitate (a) finer-grained modeling of behavior; (b) generalizeto a larger class of queries (both in terms of inputs and out-puts); and (c) support the generation of content, closing a gapin the capability of existing approaches. This research is at theconfluence of several emerging and established research areas(Fig. 1), including recommender systems (especially those thatleverage structured inputs), as well as recent developments ongenerative modeling and structured output learning. Our re-search aims to be at the forefront of a newly emerging area thatcombines personalization with modern machine learning tech-niques, for data including structured knowledge [28–30], text[31–39], and images [40–43].Research Plan. We propose an ambitious research programconsisting of three thrusts to realize this contribution. In eachthrust our goal is to show how notions like personalization andsubjectivity can be can be combined with complex, structured output models, in order to build newmodels and drive transformative new applications. In Thrust 1, we focus on knowledge extraction andQ/A, where the notions of personalization and subjectivity allow us to answer questions that depend onsubjective personal experience, and to develop systems that are more empathetic to people’s individualneeds. In Thrust 2, we focus on personalized generative models of sequences and text (and in particularLSTMs), which allow us to estimate more nuanced reactions, by predicting people’s responses (in termsof the language they would use) to previously unseen content. And in Thrust 3, we consider personalizedGenerative Adversarial Networks (GANs), allowing us to address the problem of personalized contentcreation and design, i.e., to suggest new content that would elicit a certain response, if it were created.

Each thrust is associated with specific broader impacts, to be realized through established academicand industrial collaborations: In Thrust 1, we will study ‘subjective symptoms’ from clinical Q/A dialogswith the UCSD School of Medicine. In Thrust 2, we will study personalized models of heartrate andsequence data with the UCSD Cognitive Science Department. And in Thrust 3, we will study artisticpreference models with , and personalized Visual Question Answering (VQA) with .

1

Page 3: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

←co

mpl

exou

tput

ssi

mpl

eou

tput

s→

output type example/applications prior work example input/output thrust

Prior work: ‘unstructured’recommendation

f : U × I → R Regression: Predicting opinions(ratings, clicks, etc.) [23, 44, 45]

f : U × I → IRetrieval: Predicting next actions,likely purchases, etc.

Preliminary work: multiple-output and heteroscedasticrecommendation

f : U × I → R × R+ Heteroscedastic regression: mod-eling uncertainty, etc. [22]

f : U × I → P(I)Multiple-output regression: bun-dle generation, etc. [16, 18]

f : U → U × I Barter and trade recommendation [16]

f : U × S∗ → S∗ Personalized Q/A and dialog sys-tems [12, 15] Thrust 1

This proposal f : U × I → S∗ Estimating likely reactions andpersonalized text models [20, 46] Thrust 2

f : U → I Personalized content generationand design [9, 47] Thrust 3

Table 1: Relationship between prior work, our preliminary work, and the current proposal (for a hypotheticalfashion recommendation scenario). (U=user, I=item, S∗=sequence/text, I=space of possible items)

2 BackgroundTraditionally, behavioral models and recommender systems seek to predict simple signals (e.g. pur-

chases, clicks, ratings), by training on historical data of the same form, in order to optimize regressionor retrieval benchmarks. This includes classical methods like collaborative filtering (see e.g. the Netflixprize), as well as a swathe of modern techniques like BPR, Factorization Machines, etc. [44, 45, 48].

This line of work has been extended to incorporate rich side-information in the form of text, se-quences, images, graphs (etc.). Such side information provides context that can be used to improveprediction accuracy (especially in cold-start scenarios [49]), and to increase model interpretability [50].A great deal of effort (including within our lab) has been devoted to building such models: using textto understand how people express their preferences [12, 15, 24], using sequential data to understand thecontext and evolution of people’s opinions [11, 23], and using images to understand the visual evolutionof fashion trends and artistic preferences [9, 10, 13, 51]. Note however that while the inputs to suchmodels are complex and high-dimensional, their outputs are still simple and low-dimensional.

In contrast, generative models of text, sequences, and images, have recently gained wide popularityfor applications such as image generation and captioning [52, 53], machine translation [54], and textsynthesis [55, 56]. These systems have generated state-of-the-art results for a variety of tasks, and alsosuggest new applications with high-dimensional content as the system’s output. Variants of these modelscan be trained to generate outputs that are conditioned on an auxiliary input signal [40–43, 57], thoughusing such models within the context of personalized recommender systems remains open.Research gap and opportunity. Between these two lines of inquiry we see an opportunity to develop anexciting new research area, in which we can model activities, predict behavior, and make personalizedrecommendations, in the form of rich, structured outputs, including personalized answers to complexquestions (Thrust 1), personalized textual reactions (Thrust 2), and personalized content (Thrust 3).

Our strategy to close this gap is depicted in Table 1: like existing recommender systems we seek tomodel complex interactions between ‘users’ and ‘content.’ By changing the relationships between themodels’ inputs and outputs (Table 1, Column 2), and leveraging the power of newly emerging generativemodeling approaches, we can fundamentally change the types of questions that can be addressed. Mak-ing this conceptually simple—though methodologically complex—change represents a transformativeway of thinking about predictive behavioral models, and drives a suite of important new applications.PI background & prior work. Our lab’s research mission is to develop predictive models of humanbehavior, and as such we have experience applying machine learning to semantically rich data, includingtextual, visual, relational, and temporal models. Examples include recommender systems for beers,wines, and consumer goods [3–5, 23, 24, 27], including socially and temporally-aware recommendersystems [2, 11, 58]; friendship links and social communities [1, 25, 59]; reddit submissions [26]; images,art, and fashion [6–10, 13, 14, 60]; and question-answering systems [12, 15].

In the past year, we have begun investigating several forms of ‘structured output’ recommender sys-tems, preliminary to the main thrusts of this proposal. For example, a simple form of a structured outputrecommender is a heteroscedastic regressor, i.e., a regressor that models both an output as well as itsuncertainty about that output. In [22], we used this form of regressor to suggest more efficient surgery

2

Page 4: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Thrust Mission Enabling technologies prelim.work

Short term goalsand intellectual impact

Long term goalsand broader impacts

1

Make behavioral mod-els more interactiveby answering complexqueries

Subjective knowledge ex-traction from crowds [29],MoE-based Q/A systems[12]

[12, 15]Develop personalized and sub-jective ‘opinion bases’ and Q/Asystems

Neurotological Q/A and dialoganalysis for patient intake andtriage (w/ UCSD School ofMedicine)

2

Make behavioral mod-els more contextual byestimating nuanced re-actions

LSTMs, encoder-decoderarchitectures, sequence-to-sequence models [54]

[24]

Develop personalized LSTMsand text models to estimate nu-anced reactions to unseen con-tent

Personalized sequence modelsfor heartrate monitoring and ur-ban sensor data (w/ UCSDCogSci Department)

3Make behavioral mod-els more creative bygenerating new content

Siamese Nets [61], GANs[62, 63], conditionalGANs [40–43, 57]

[9, 47]Develop personalized Genera-tive Adversarial Networks forcontent synthesis and design

Personalized VQA and artis-tic preference modeling (with

and )

last 1-3 years =⇒ next 2-4 years =⇒ next 3-6 yearsTable 2: Short term goals, long term goals, broader impacts, and collaborators.

schedules, showing that both duration as well as our uncertainty about the duration varies due to the char-acteristics of the operation and the individual surgeon. We have also recently considered other formsof ‘simple’ structured recommendation, such as multiple output recommendation and set recommenda-tion. Examples include recommending trading partners in online bartering platforms (where we mustrecommend trading partners, as well as a pair of items they can mutually trade) [16], or recommendingbundles of products to be purchased together [18] (where one has to consider issues such as compatibil-ity and diversity). Although simpler than the types of complex outputs considered in this proposal, whatour preliminary work demonstrates is: (1) There is genuine research gap to be filled, in the sense thatthere are a variety of open problems and new applications at the intersection of structured output andpredictive behavioral models, ranging from surgery scheduling [22] to DDR step-chart generation [20].And (2) In addition to driving new applications, even ‘traditional’ objectives (like duration prediction, aregression task) can be quantitatively improved by considering structure in the output space.

3 Research trajectoryOur overall research trajectory is depicted in Table 2. As mentioned, each thrust is motivated by

a specific application that extends structured output modeling with personalization. Each builds uponenabling technologies from prior work on knowledge extraction systems, generative text models suchas LSTMs, and generative image models such as GANs. Our short term goals consist of developingspecific extensions to these technologies to explain effects that arise due to user variance. Building onestablished technologies and metrics ensures the feasibility of our evaluation plan (Section 6).

While the ‘obvious’ applications of our research are to domains like recommender systems, opinionmining, and e-commerce, our focus on these applications is largely opportunistic, as these are domainsfor which large, labeled datasets are widely available. Our long term goals are concerned with explor-ing related applications where ‘users’ also exhibit substantial variance, such as clinical notes, heartratedata, etc. Realizing these goals is a multi-year process, due to the need to manage interdisciplinarycollaborators, collect sensitive data, survey human subjects, etc.Challenges, Feasibility, and Qualifications. Developing practical approaches to modeling human be-havior from large scale, high-dimensional datasets presents a range of challenges: Methods must bescalable, as many of the datasets we consider (Table 9) span several terabytes of actions, text, and im-ages. Moreover, such datasets are sparse and long-tailed when we consider the problem of modelingindividuals. Thus we need models that trade off complexity with parsimony: generative models of textand images already require millions of parameters and lengthy training schedules, thus we must extendthem while adding only a modest number of parameters and computational overhead.

Our approach to these challenges incorporates ideas from recommender systems—where scalabilityand sparsity issues are rife—including dimensionality reduction and regularization techniques, into end-to-end learning systems for Q/A, text, and image data. Our lab has significant experience dealing witheach of these domains (e.g. Q/A [12, 15], text [5, 24], and images [6–10, 13, 14]), including some of thelargest-scale studies to date, spanning hundreds of millions of activities mined from sources includingAmazon [5], facebook [25], and Google [21]. Although we acknowledge the complexity of a researchprogram consisting of several challenging tasks, each with their own academic and industrial partners,this is exactly the kind of cross-disciplinary, collaborative research at which our lab excels, havingcollaborated with 33 authors across 10 departments and institutions during the last two years alone.

3

Page 5: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

4 Contributions and Impact4.1 Intellectual Merit

Methodologically, our research contributes to each of the areas upon which it builds. To the areaof generative modeling and structured output learning, our contribution is a new suite of techniquesand approaches (‘personalized’ knowledge extraction, personalized text generation, personalized imagegeneration, etc.) that can be used to model applications where the ‘user’ explains substantial variance inthe data, due to personal preferences, biases, subjectivity (etc.). Such applications are ubiquitous withinrecommender systems and social media mining, but also emerge when modeling (for example) clinicalnotes (Thrust 1), heartrate data (Thrust 2), etc.

Figure 2: Thrusts, applications, and impact.

Our research also drives new applica-tions in recommender systems and person-alized modeling. Our goal is to trans-form the way systems are currently trainedand evaluated, by re-imagining the notionof ‘content’ as more than just features toaugment a system’s performance, but asdesirable outputs that enhance interactiv-ity, context-awareness, and creativity. Therelationship between our contributions, interms of methodology, applications, andbroader impacts is depicted in Fig. 2.

4.2 Broader ImpactsHealth. Our collaboration with the UCSDSchool of Medicine shall investigate sys-tems for patient intake and triage, by pre-dicting the severity of symptoms from auto-mated Q/A dialogs. The primary challengeto overcome is the subjectivity with whichsymptoms are reported, which we expect tobenefit from personalized Q/A systems (Thrust 1). And with the UCSD Cognitive Science departmentwe shall investigate personalized models of heartrate data, which we cast as a structured output taskfor the purpose of recommending personalized exercise routines (Thrust 2). Broadly, for applicationslike personalized health, it is critical to deal with structured data, while simultaneously accounting forvariance that arises due to individual behavioral patterns [22].Commercialization. Broadly, work on personalized recommender systems has direct impact to count-less industrial applications, such as e-Commerce [5], social media mining [26], and even video gam-ing [20]. Our lab has ongoing collaborations with industrial partners, to develop personalized visualmodels of artistic preferences (with ) and personalized Visual Question Answering Systems (with

). These collaborations are described in Thrust 3. Note that while our work is clearly of commer-cial interest, the goals of our proposal are basic-science contributions that differ from what is typicallyfunded via industrial grants; rather, our industrial collaborators provide complimentary expertise, andhelp to ensure that the potential impact of our work is realized.

Letters of Collaboration from the UCSD School of Medicine, , and are attached.Education and Outreach. In the coming year we shall create an additional research-oriented class thatclosely follows the thrusts in this proposal, and develop the intro-level curriculum for UCSD’s upcomingData Science major and minor. We are also involved in mentorship programs aimed at increasing theretention of URMs in CS, and other activities targeting high-schoolers and engineering professionals.We regularly organize community workshops, including the SoCal Machine Learning Symposium, andthe Workshop on Mining and Learning with Graphs.Data. The dissemination of new datasets is an ongoing effort and major contribution of our lab. Al-though we have collected a wide selection of datasets that can already support the proposed work (Ta-ble 9), new data will be collected throughout this proposal. The PI has a track record of making all dataand code publicly available; many of the datasets we have collected have become de facto standards inacademia and industry, having been used in dozens of ML and data mining classes worldwide (includingour own), and receiving thousands of downloads per month.

4

Page 6: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

product: Mommy’s Helper Kid KeeperQuestion: “I have a big two year old (30lbs) who is very active and pretty strong.Will this harness fit him? Will there be anyroom to grow?”

sample responses: “So if you have big babies, this maynot fit very long.”; “They fit my boys okay for now, butI was really hoping they would fit around their torso forlonger.”; “I have a very active almost three year old whois huge.”

product: Sanford EXPLORERV VistaExplorer 60” TripodQuestion: “Is this tripod better then theAmazonBasics 60-Inch Lightweight Tri-pod with Bag one?”

sample responses: “However, if you are looking for asteady tripod, this product is not the product that you arelooking for”; “If you need a tripod for a camera or cam-corder and are on a tight budget, this is the one for you.”;“This would probably work as a door stop at a gas station,but for any camera or spotting scope work I’d rather justlean over the hood of my pickup.”

product: Thermos 16 Oz Stainless SteelQuestion: “how many hours does it keep hotand cold ?”

sample responses: “Does keep the coffee very hot for sev-eral hours.”; “Keeps hot Beverages hot for a long time.”;“I bought this to replace an aging one which was nearlyidentical to it on the outside, but which kept hot liquidshot for over 6 hours.”; “Simple, sleek design, keeps thecoffee hot for hours, and that’s all I need.”; “I tested it byplacing boiling hot water in it and it did not keep it hotfor 10 hrs.”; “Overall, I found that it kept the water hot forabout 3-4 hrs.”

Figure 3: Examples of community questions showing dependence on context (left), dependence on the querier(middle), and the inconsistency of opinions (right). From our prior work [12, 15].

Thrust 1: Answering Complex Questions:Personalized Knowledge Extraction and Q/A Systems

Introduction and problem statement. Extracting useful and actionable knowledge from massive vol-umes of unstructured data is a fundamental problem with applications in a variety of domains. Systemsrange from IBM’s Watson—used to answer Jeopardy questions [30], to Microsoft’s system that analyzessearch logs to predict adverse drug reactions [64], to academic systems that discover relationships inliterature ranging from pharmacogenomics to paleontology [65].

Building such systems means addressing two basic problems: One consists of condensing huge vol-umes of unstructured or semi-structured data—especially text—into codified form, i.e., to build ‘knowl-edge bases’ of facts or evidence. The second problem consists of querying the knowledge base in anatural and interactive way, whether to identify the entities being referred to in a Jeopardy question, orto predict a drug reaction from patterns in search logs.

Existing work on knowledge base extraction generally assumes that there is some ‘truth’ to be dis-covered, and that deviation from this truth is due to noise or misinformation that must be corrected.However this may be too simple an assumption in the context of recommender systems, where questionsare subjective and require personalized answers. Consider for example the questions in Fig. 3 (takenfrom our corpus of questions from Amazon [12]). In the first example (left), the user provides contextdescribing their specific use case. In the second (center), the user asks a question that is highly subjective(asking which item is ‘better’); furthermore a good answer to this question should be personalized tothe querier (are they an expert? on a budget? etc.). In the third (right), a seemingly objective question isasked, yet responses are remarkably inconsistent, reflecting a wide variety of personal experiences.

What the above suggests is the need to develop knowledge extraction and Q/A systems that accountfor issues such as user-dependent context, personalization, and subjectivity. Through this thrust wehope to achieve these goals by combining knowledge extraction and personalization techniques, via tworesearch questions: first to deal with the ‘subjective knowledge extraction’ problem (RQ1.1), and secondto use this knowledge to provide personalized answers to complex subjective questions (RQ1.2).

Thrust 1. How can the notions of personalization and subjectivity be used to improve the perfor-mance and interactivity of knowledge extraction and question/answering systems?

home &kitchen

healthOkapi BM25+ 84.0% 81.7%‘Factual’ Text 86.3% 84.2%Subjective Text 90.7% 88.0%

Table 3: Prior work and motivation(from [12]). Performance on sim-ple Q/A tasks can be improved with‘subjective’ knowledge, such as thatfound in product reviews. Numbersshown are the fraction of times a cor-rect answer is selected over an alter-native on two Q/A corpora.

Prior and preliminary work. In [12, 15] we collected a dataset of 1.4million questions (and answers), which we used to study the problem ofquestion answering using consumer reviews. Our research raised severalinteresting attributes of real-world Q/A communities:1. Real Q/A communities surface multiple inconsistent answers (we col-

lected around four answers per question, on average) [15].2. Even questions identified as being ‘objective’ by crowd workers yield

inconsistent answers, based on differing personal experiences [12].3. Traditional Q/A and information retrieval techniques perform poorly

when confronted with nuanced and subjective questions [12, 15].Most importantly, we found that substantial performance improvements can be achieved by making useof subjective information, such as product reviews and opinion data, rather than relying on sources of‘factual’ information alone (Table 3). While our prior work demonstrates the need for personalizationand subjectivity within Q/A systems, the predictors were limited to binary Q/A problems or multi-class

5

Page 7: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

answer selection. Given our fundamental research goal, here we plan to extend such methods to extractstructured knowledge and to answer complex questions interactively.

RQ1.1. Can we generalize knowledge bases and entity-relationship models by extracting ‘opinionbases’ that account for the range and subjectivity of personal experiences?

⟨ attribute, operator, comparison, preference score, subjectivity score, according to ⟩“doesn’t taste as good as the new flavors” ⟨ ‘taste’, ≤, ‘new flavors’, 4.0, 1.0, Kevin HG ⟩“the chocolate is by far better” ⟨ ‘taste’, ≤, ‘chocolate’, 3.0, 1.0, Enner ⟩“tastes like elmers glue with a fruit loop aftertaste” ⟨ ‘taste’, ∼, ‘elmers glue’, 2.0, 1.0, NC ⟩“I had really bad hives-inducing allergic reactions tothis product...I have since switched to Vega One pow-der, which is soy free, and have had no reactions”

⟨ ‘allergenic’, ≥, ‘Vega One’, 1.0, 0.5, Cembalista ⟩

Table 4: Examples of the type of entity relationship to be extracted (from reviewsof ‘Soylent Meal Replacement Drink’). Relationships may capture comparativejudgments, and be associated with measurements reflecting their subjectivity.

First we will design entity-relation-ship models that account for subjectiverelationships, as well as the characteris-tics of the opinion holder (Table 4). Thisrequires both (a) defining ontologies todescribe the types of relations to be ex-tracted, followed by (b) a crowdsourcingeffort to extract ground-truth labels for training and evaluation. For the former we will adopt techniquesfrom existing work on entity-relationship modeling [66, 67], while for the latter we will use techniquesto resolve ambiguity [68] and mine subjective knowledge [29] from crowds.

Feature space Relation space

Figure 4: We plan to extract relationshipsusing embedding techniques, where enti-ties (h, t) are related via embeddings Mr

and translation vectors (see e.g. [66]).

The next task consists of developing techniques to automaticallyextract further entity relationships from our complete corpus, using su-pervised or semi-supervised knowledge-graph completion techniques.We plan to make use of embedding techniques for relationship ex-traction, following our previous work [12, 14, 21] and prior work forknowledge graph completion [66]. Here entities to be linked (e.g. twoitems being compared) are related via embeddings; using multiple em-beddings [14] or translation vectors [21, 66] allows for complex non-metric relationships to be learned between entities (Fig. 4):

p(relation) = σ(hMrt) σ(Mr,0h+Mr,1t) σ(Mrh+Mrt+ b) σ(Mr(u, s, p)h+Mr(u, s, p)t+ b)bilinear model translation-based models personalized translation-based models

[12, 15, 69] [14, 21, 66] RQ1.1

We have previously used similar embedding techniques to establish item-to-item relationships for rec-ommendation scenarios [5, 14, 21]. Although these techniques can be applied as-is as baselines forentity extraction, our main contribution here shall be to extend these methods with parameterized em-beddings Mr(u, s, p) that account for user latent factors (u), the subjectivity (s) of the statement inquestion, and other available features such as preference scores (p). We anticipate that accounting forsuch personalization and subjectivity factors shall lead to improved performance for entity relationshipextraction, just as they did for simple Q/A tasks in [12, 15].

RQ1.2. How can the subjective and personalized knowledge bases developed in RQ1.1 be used tobuild more personalized and interactive Q/A systems?

Having built a labeled dataset and used it to extract complex and subjective relationships, we mustnext use the extracted information to build structured question answering systems. In our prior workwe built systems for answering binary questions and multiclass answer selection [12, 15]. To do so wemade use of mixtures-of-experts type frameworks to answer (binary) questions using bilinear models:

p(answer to is yes|q)∑

f∈ text fragments

p(f is relevant|q)p(y|f is relevant, q) =∑f

σ(fMqT ) · σ(fM ′qT ). (1)

Note that this framework from our prior work ignores the knowledge extraction task (RQ1.1) altogether,and instead tries to identify significant fragments via a latent relevance model; this proves effective forsimple questions (Eq. (1) is a binary classifier), but does not generalize to generate structured outputs.

Here we plan to (a) improve the performance of these systems using the knowledge bases extracted inRQ1.1; (b) to use structured knowledge from RQ1.1 to answer complex and open-ended questions, and(c) adapt Eq. (1) so that answers are tailored to the preferences of the querier. For (a) we can adapt thesame framework from Eq. (1) to make use of structured knowledge (Table 4) rather than raw text; hereour goal is to demonstrate that Q/A performance for subjective questions can be improved by making useof structured knowledge rather than free text alone. For (b) we shall make use of question analysis andquestion classification techniques similar to those already developed [70]. This consists of converting aquestion into a query in our knowledge base. We will initially start with simple and predefined sets ofqueries (e.g. to mine comparisons [71] and compatibility relationships [72]), before considering morecomplex and arbitrary relationships. For (c) we need to further parameterize the relevance models (M )to include characteristics of the user entering the query (i.e., to discover what relations are true for you).

6

Page 8: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Broader impacts: Q/A Systems for Patient Intake and TriageAlthough datasets such as the one we have collected from Amazon can already be used for training

and evaluation, to increase the broader impacts of this proposal we are also developing structured Q/Asystems for the specific problem of automated patient intake and triage (Fig. 5).

Figure 5: (Synthetic) clinical Q/A dialog.

Together with (Professor of Medicine,UCSD), and (a practicing physician), we areworking with the UCSD Clinical and Translational ResearchInstitute (CTRI) and the Institute of Engineering in Medicine(IEM) on a pilot study that will collect Q/A dialogs for neu-rotology patient intake (neurological disorders of the ear,Fig. 5). During the next 18 months, this study will collectQ/A data and medical notes from real patients, and developinitial prototypes. Note that the goal here is not to auto-matically answer questions, but rather to estimate whether a particular Q/A dialog corresponds to ahigh-priority case (e.g. whether it is benign or refers to an emergency situation), which can be used forautomated triage. We expect that this type of analysis will benefit from systems that analyze text in apersonalized way, given the subjectivity of how symptoms are reported.Evaluation. Detailed evaluation plans for all thrusts are included in Section 6.

Thrust 2: Predicting Complex Reactions:Personalized Generative Models of Text and Sequences

Introduction and problem statement. While existing personalized models focus on predicting whatcontent users like or dislike, a more personalized and contextual system might tell users why they likesome content, for example by estimating what they would say (i.e., the text they would use) in reactionto it. Such a system could be used to make more nuanced recommendations (by telling users both whatthey will like and why), or be used by content creators and marketers to forecast reactions in detail.

Traditionally, ‘context’ is added to personalized models by training systems that are ‘context-aware’[73–76] or ‘aspect-aware’ [77–81], e.g. to disentagle opinions into multiple dimensions. Again though,such systems treat context as input to the model, and still focus on predicting simple outputs.

In this thrust we consider whether it is possible to estimate detailed reactions to content, in the formof text that a user might write in response to a given stimulus. For example, we might estimate whatreview a user would write for a product they haven’t reviewed yet (Fig. 6), or what specific aspectsthey might focus on, etc. This is a challenging language modeling task, as we must estimate (a) whattypes of language are used to describe the item? (b) what types of words does the user tend to favor?And (c) what words describe the interaction between the user and the item (e.g. the user’s preference)?These issues are further exacerbated by the need to work with incredibly sparse datasets, and the needto generate long sequences, a particularly challenging problem for generative language models [39].

Thrust 2. How can we adapt generative models of text and sequences to explain the variance thatarises due to subjective and personal factors, by predicting how an individual wouldrespond to a certain stimulus?

Mosstrooper’s predicted review forShocktop Belgian White:

Poured from 12oz bottle into half-liter PilsnerUrquell branded pilsner glass. Appearance: Poursa cloudy golden-orange color with a small, quicklydissipating white head that leaves a bit of lacebehind. Smell: Smells HEAVILY of citrus. Byheavily, I mean that this smells like kitchen cleanerwith added wheat (contd.)...

Actual review from test set:Poured from a 12oz bottle into a 16oz Samuel AdamsPerfect Pint glass. Appearance: Very pale goldencolor with a thin, white head that leaves little lac-ing. Smell: Very mild and inoffensive aromas of citrus.Taste: Starts with the same tastes of the citrus and fruitflavors of orange and lemon and the orange taste isall there (contd.)...

Figure 6: Preliminary experiments. Exam-ple of a generated (i.e., predicted) review fora particular user/item combination (using aGenerative Concatenative Network, Fig. 7e),along with its groundtruth from the test set.

Prior and preliminary work. Our work builds generally upon the ar-eas of content and context-aware recommender systems [73–76], andmore specifically on ‘aspect-aware’ sentiment analysis, itself a sim-ple form of ‘structured output’ recommendation that aims to estimatefine-grained reactions in terms of a fixed set of aspects [77–81].

Generating text has long been cast in terms of trying to predictthe next token (a word or character) given those seen so far [82].Potential models to build upon include [83] (which seeks to gener-ate wikipedia articles), as well as sequence-to-sequence models andencoder-decoder architectures [54]. Such models have been used tomodel reviews (and ‘tips’), though typically not with a focus on per-sonalized generation [31, 32, 36, 38, 39].

Our own work on this topic dates back to [24], where we showedthat recommendation performance can be improved by combining tra-ditional recommender systems with topic models. This has been ex-

7

Page 9: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

tended to use various language models, including CNNs [38], and LSTMs [32, 36, 84] (as we proposeto use here), though again these models struggle to generate outputs in a personalized way.

RQ2.1. How can generative text models be adapted to capture user-, content-, and interaction-specific language, in order to estimate detailed textual reactions to content?

PerplexityMean Median

Unsupervised (Fig. 7b) 4.23 2.22Personalized (Fig. 7e) 2.25 1.98

Table 5: Preliminary experiments. Com-bining language models with recommen-dation techniques lead to more accurateand personalized models of language.

Two of the main challenges in generating personalized responsesare (a) sparsity, i.e., learning the variance across peoples’ writing stylesbased on a limited number of observations, and (b) generating long andstructured sequences. For the former it would seem that the frameworkof encoder-decoder RNNs would be appropriate (Fig. 7c,d) [54]; thistype of model has been used (for example) to caption images by firstlearning an image representation which is passed as input to an LSTM[53]; similarly, we might learn user and item representations as input toan RNN (Fig. 7d). However preliminary experiments showed this to be ineffective when generatinglong sequences [46, 85], as the model quickly ‘forgets’ its input and saturates to a background (i.e.,non-personalized) model.

An effective way to encourage the model to ‘remember’ its input/context is simply to replicateuser and item representations at each step (via one-hot encodings), from which the model can learninteractions between the two (which we term a ‘Generative Concatenative Network,’ or GCN; Fig. 7e).This technique was used to generate the example in Fig. 6: this is a promising initial result, as the modelis able to capture both the high-level structure of the review, as well as estimating the user’s subjectiveresponse to its attributes. This method also captures language more accurately in a quantitative sense(Table 5).

Figure 7: Selection of model architectures to be developed and points of comparison to beconsidered. Top left to bottom right: (a) Latent factor recommender system; (b) LSTM;(c) Encoder-Decoder LSTM; (d) Encoder-Decoder architecture for reaction estimation; (e)Generative Concatenative Network (GCN) [46]; Bottom row: (f) Factorized GCN; (g) Semi-Supervised Factorized GCN; (h) Conditional GCN for sequence modeling.

However there are three lim-itations that we seek to over-come to make such models ef-fective: (a) The model requiresseveral days of training (given amodest corpus of ∼200,000 doc-uments); this is largely an issueof dimensionality that arises dueto one-hot encodings of users anditems (which cannot scale beyonda few thousand users/items); weplan to address this by inves-tigating models that learn low-dimensional user/item represen-tations that are passed as input to the LSTM at each step (Fig. 7f). (b) The model requires substantialamounts of observations per user and item (on the order of 100 documents) to learn effective personal-ized representations; this is not always realistic in sparse datasets. We plan to address this by investi-gating ‘semi-supervised’ learning approaches that learn user/item representations from more abundantsources of data (ratings, clicks, etc.) in addition to a limited number of documents (Fig. 7g). And (c) agenerative language model alone is not yet effective as a means of making predictions and recommen-dations; we address this in RQ2.2. A selection of possible architectures to be compared to achieve thesegoals is shown in Fig. 7.

RQ2.2. How can the personalized language models from RQ2.1 be extended to make effectivepredictions and recommendations, and applied to other forms of sequence data?

Our prior work [24] showed that combining language models with recommender systems leads tomore accurate performance (in terms of predicting ratings), as we can learn users’ preferences more ef-ficiently from review content than we could from preference data alone (which makes sense, as reviewsare intended to ‘explain’ the dimensions of people’s preferences). Our first goal with this research ques-tion is to show that the same is true for the Generative Concatenative models developed in RQ2.1, whichwe plan to achieve by investigating joint training training schemes for ‘BPR-like’ activity predictionmodels and GCNs. For example, a trivial formulation might follow the training approach we introducedin [24]: ∑

u,i,i′

log σ(xu,i(γu, γi)− xu,i′(γu, γ′i))︸ ︷︷ ︸

activity model (following BPR training scheme)

+µ log perplexity(corpus|γu, γi)︸ ︷︷ ︸language model (GCN or similar)

. (2)

8

Page 10: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Second we will evaluate our models in terms of novel recommendation objectives. In particular weare interested in showing that GCN language models can be used for personalized search and retrieval,and to more effectively navigate existing content. For example a user might describe the desirableattributes of an item (e.g. ‘good suit to wear to a wedding’), and the system would surface the itemsabout which the user would be most likely to make such a statement; this connects to our goals fromThrust 1, as doing so provides an additional method to solve retrieval tasks in terms of personalized andsubjective language. We also plan to investigate how our approach can be used for personalized reviewsummarization and aspect-based discovery, following ideas from [86–92].

Broader impacts: Personalized Models of Heartrate Data for Exercise RecommendationBeyond learning personalized generative models of text, we plan to apply the same technology

to other forms of sequential data. We will collaborate with(through a co-advised student, ) to use Generative Concatenative Networks to build per-sonalized models of heartrate data, using a large dataset of workout histories (including GPS, heartratelogs, weather, etc.) collected from EndoMondo. We cast this as a sequence-to-sequence prediction task(see Fig. 7h) that aims to model the heartrate sequence a user will exhibit given (GPS data of) a work-out to be undertaken. We will use models of this data to build recommender systems that can suggestactivities and design exercise schedules that will help individuals to achieve personalized fitness goals.

Thrust 3: Generating Complex Content:Personalized Design and Generative Adversarial Recommendation

Introduction and problem statement. Our final thrust of research consists of studying how personal-ization and recommendation can be combined with Generative Adversarial Models. Our main goal withthis thrust is to design new systems that can facilitate the personalized generation of new content, ratherthan merely retrieving content that already exists.

We consider this problem within the context of ‘visually-aware’ recommendation. This allows usto combine our own prior work on visually-aware recommender systems [7–11, 14, 19], with recentdevelopments on comparitive image models [7, 61] and generative adversarial models of images [62, 63].Again, our goal is to show that an important class of existing structured output models can be extendedand improved by incorporating personalization and behavioral modeling techniques.

Such methods have immediate applications to domains like fashion recommendation and design (forexample), but more broadly to any domains where subjective visual features guide people’s decisions.This research has the potential to (a) lead to more accurate personalized retrieval techniques, by betterunderstanding the visual dimensions that guide people’s decisions; (b) to develop more expressive gen-erative models that can be conditioned on subjective factors; and (c) to take a first step toward buildingsystems that can be used to aid the (personalized) design of new content.

Thrust 3. How can personalization be used in the generative design of content, i.e., to suggestthe characteristics of content that users would interact with, if it were created.

Prior and preliminary work. Recently, we have shown that recommender systems can perform signif-icantly more accurately when made to be ‘visually aware,’ especially in sparse datasets or ‘cold-start’settings [9, 10] (VBPR, Table 6). So far, we have built systems that incorporate visual features ‘off theshelf,’ in concert with content-aware recommendation approaches [9]. We have shown that such tech-niques can be used to recommend clothing and outfits [6, 9], to understand the evolution of styles overtime [10], and to understand the visual and social factors that influence people’s artistic preferences [13].

BPR VBPR DVBPR[45] [9] RQ3.1

Amazon fashion 0.628 0.748 0.796Amazon fashion (cold) 0.551 0.732 0.772Tradesy.com 0.586 0.750 0.786Tradesy.com (cold) 0.542 0.753 0.779

Table 6: Prior and preliminary results (in termsof AUC). In [9] we showed that improved resultscan be obtained by incorporating visual featuresinto content-aware recommenders (VBPR); in[10] we showed that the same techniques can beused to study fashion trends over time.

In parallel, there are two lines of research from computer vi-sion that we seek to extend via behavioral modeling techniques.The first is Siamese CNNs [61]: these are discriminative modelsthat can be trained to make comparisons between images using two‘copies’ of the same CNN. This can be used (for example) for faceverification [93], or to determine whether two items belong to thesame style [7], though not in a personalized way. Two recent workssuggest that such models could be used to develop better person-alized recommender systems: In our own work [7] and in [94],Siamese CNNs are used to identify visually compatible items, andin [95] comparative judgments between images are modeled, whichis a similar problem to personalized recommendation.

9

Page 11: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Pretrained Network

Item Model

User Model

(a) (b) (c) (d) (e) (f) (g)

Figure 8: Models to be developed and evaluated in Thrust 3. We seek to design personalized models for contentcreation and design. A selection of possible architectures (including from prior work) are shown above. Fromleft to right: (a) (non-personalized) Siamese network [61]; (b) Pre-trained network within content-aware RecSys(VBPR, [97]); (c) Deep VBPR (RQ3.1); (d) retrieval with DVBPR; (e) Discriminator; (f); Personalized GAN(RQ3.2); (g) Joint GAN/recommender.

A second line of work consists of models for image generation, and in particular Generative Ad-versarial Networks (GANs). These are unsupervised learning frameworks in which two components‘compete’ to generate realistic looking outputs [62, 63]: a generator is trained to generate images, whilea discriminator is trained to distinguish real versus generated images. Thus the generated images look‘realistic’ in the sense that they are indistinguishable from those in the training set. Such systems can alsobe conditioned on additional inputs, in order to sample outputs with certain characteristics [40–43, 57].

We seek to combine these lines of work through two technical contributions. First, we shall inves-tigate recommender systems that can handle high-dimensional image content in an end-to-end fashion,following a recent trend of incorporating representation learning techniques into recommender systems[96]. Having developed these representations, we seek to use our models generatively, so that richcontent is not just an input to the system but also an output.

RQ3.1. How can image representation and generation techniques be personalized, and converselyhow can we extend systems for content-aware recommendation to make use of end-to-endimage representation techniques?

Our first task is to investigate end-to-end visually-aware models for personalized recommendation,extending our prior work that used features ‘off-the-shelf’ within content-aware recommender systems.Learning such personalized representations is necessary before we attempt to synthesize new items inRQ3.2, and we expect that doing so will also lead to improved predictive performance.∑

u,i,i′ ℓσ(xu,i − xu,i′ ) BPR [45]

xu,i = α+ βu + βi + γTu γi︸ ︷︷ ︸

latent user/item preference

+

visual user/item preference︷ ︸︸ ︷θTu (Efi) VBPR [9] Fig. 8b

xu,i = α+ βu + θTuΦ(Xi) ‘Deep’ VBPR RQ3.1 Fig. 8c

argmaxe∈X θTuΦ(e) retrieval RQ3.1 Fig. 8d

argmaxe∈G(·) θTuΦ(e)− η[D(e)− 1]2 synthesis RQ3.2 Fig. 8e-g

Table 7: Selection of formulations to be developed and compared for Thrust 3.

Our prior work made use of the BPRframework [45], in which comparativejudgments are modeled between users andcontent by fitting a function xu,i such thatitems i the user u prefers (likes, purchases,etc.) are associated with a higher score thanother items i′. Training consists of maxi-mizing pairwise differences (Table 7, row1). xu,i can be any function, e.g. our priorwork made use of compatibility functionsthat incorporate image features from a (pre-trained) CNN (Table 7, row 2) [9].

We shall investigate a variety of end-to-end representations (including those in Fig. 8) in order todetermine which are best for preference learning, and which can most effectively be adapted to recom-mendation tasks. The idea here is that both image representations (Φ(·)) and users’ preferences towardthem (θu) are learned jointly (Table 7, row 3). Our preliminary results (Table 6) show that even a trivialarchitecture (a ‘vanilla’ Siamese CNN, with a user preference term, Fig. 8c), already improves resultsover pre-trained models. Following this, RQ3.2 seeks methods for GAN training that are capable ofsynthesizing items that look realistic while targeting the preferences of specific users.

RQ3.2. How can recommender systems be used generatively for the purposes of content design,and how can users (or designers) effectively navigate such systems?

10

Page 12: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

real item from dataset personalized design for a specific user

continuous manifold of plausible designs

generated items from GAN

Figure 9: We seek not just to recommend existing content, but to useour system to design new content, that may be desirable to specific usersor groups. For example, we can generate items with desirable or fash-ionable attributes. To do so we seek to estimate a manifold of plausibleitems (above) that can be explored in a personalized way.

Existing systems, including those to be de-veloped in RQ3.1, are concerned with predictingwhich existing items and content a user will inter-act with (purchase, click, view, etc.). By buildingpersonalized models capable of synthesizing newcontent, we hope to develop tools that might beused by content creators or designers in order togenerate, or to suggest attributes of, content thatmight be desirable to a user, population, or group.

An example of this type of contribution isshown in Figure 9. Essentially our goal is to de-velop methods that can efficiently sample arbitrary ‘realistic’ items from a prior distribution, and laterto condition samples on the preferences or attributes desired by individual users. Generative Adversar-ial Networks (GANs) [98] offer an effective way to capture such a distribution and have demonstratedpromising results in a variety of applications [99, 100]; in particular, conditional GANs [40] can gen-erate images (or other content) according to semantic inputs like class labels [41], textual descriptions[57], etc.

Figure 10: Preliminary results.Examples of personalized GAN-generated items (Fig. 8f).

The overall ideas of such a system and potential architectures are shownin Figure 8. In a traditional GAN model (Fig. 8e and f), a discriminator D anda generator G are trained to optimize a value function V (G,D) [98]:

minG

maxD

V (D,G) = Ee∼pdata(e)[logD(e)] + Ez∼pz(z)[log(1−D(G(z)))].

A trivial way to use such a model to generate personalized items is to max-imize an objective that balances a user’s preference score θT

uΦ(e) with thelikelihood of the image being ‘real’ [D(e)− 1]2 (Table 7, row 5); we used thissimple objective to generate the preliminary experiments in Figs. 9 and 10.

These preliminary results are promising in the sense that they look ‘plausible,’ and have higher per-sonalized objectives θT

uΦ(e) compared to existing items in the corpus. Following these results we seekto address the following problems: (a) evaluate, extend, and improve the results using deeper conditionalnetworks modified from [63, 100]; (b) investigate joint training schemes that simultaneously optimizethe recommendation and generation objectives (e.g. following the model in Fig. 8g); (c) conduct userstudies to evaluate the quality of generated items. For the latter, much as we did in Thrust 2, we aim toshow that the model is not merely useful as a means of synthesizing new content, but can also be usedto better navigate and retrieve existing content in a quantitative sense [101].

Broader impacts: Artistic Recommendation and Personalized VQA

Figure 11: With XXXXXX we are workingon systems for personalized Visual Ques-tion Answering.

This thrust is connected to work with two of our lab’s collaborators.With , we have built models for artistic recommendation that in-corporate visual and social features, using both public and proprietarydata from the artistic community Behance [13], and more recently wehave begun preliminary work developing models that incorporate bothSiamese Networks and GANs (RQ3.1) [47].

Our lab will also collaborate with to build systems to an-swer personalized queries regarding the visual appearance of items.Even simple queries (e.g. finding ‘good shoes to wear to a black-tiewedding’) are challenging due to issues of ambiguity, subjectivity, andthe dependence on visual appearance. Solving this problem is con-nected to Thrust 1, due to our need to answer subjective queries, andThrust 3, due to our need to understand preferences within image mod-els. This line of research extends recent developments on Visual Ques-tion Answering (VQA) [102] to allow for personalization and subjectivity (Fig. 11).

Our research in this thrust focuses largely on applications in art and fashion. While these applicationsare important in and of themselves [103–118], again our focus on them is largely due to the fact that theseare domains for which the best data is available, i.e., large image corpora, with associated preferencejudgments (reviews, likes, etc.). If successful, we plan to apply the same technology to other areaswhere visual factors influence people’s decisions, as well as to non-visual domains where GenerativeAdversarial training can be used to design content (e.g. text [33, 119–121]).

11

Page 13: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

task example metrics established in comparison points/baselines datasets (Table 9)

RQ1.1 Precision/Recall (hit rate), AUC [12, 15, 66]linear/bilinear models, translation-based embedding techniques Extracted entity relations from

Amazon Q/A and neurotologicaldialogsRQ1.2 AUC, MAP, Precision@k [12, 15, 161, 163]

Relevance ranking techniques(e.g. Okapi-bm25+)

RQ2.1 Perplexity, per-token accuracy, F-score [20, 46] char-LSTM, GCN, Fig. 7b-g BeerAdvocate, Amazon, Yelp,EndoMondo, Strava, DDR, etc.RQ2.2 RMSE, AUC, subjective preference [45, 47] BPR and MF-like methods, Fig. 7h

RQ3.1 RMSE, AUC [9, 10] VBPR, TVBPR [9, 10], Fig. 8b,cBehance, Amazon

RQ3.2 Inception score, SSIM, diversity [41, 62, 175]GANs and Conditional GANS(e.g. [40, 41, 174]), Fig. 8d-g

Table 8: Selection of metrics, baselines, and datasets to be used for evaluation.

5 Further Related WorkThe overall themes of this proposal are grounded in traditional approaches to behavioral modeling,

i.e., methods that establish latent user preferences and item properties from historical activity traces[45, 122–128]. A variety of methods extend traditional approaches by making use of content and context[50, 129–131], including temporally-aware [132–142], and socially-aware [2, 19, 76, 131, 143–146]models; such methods have also been extended to incorporate modern (i.e., ‘deep’) learning techniques[96, 147–149]. These techniques—which combine rich and heterogeneous forms of content to improveprediction—are also related to ideas from multimedia and information fusion [150–158]. Although webuild on such technology (as well as our own work on content-aware behavioral models [1–25]), themain novelty of our work is the focus on ‘content’ in the output space, rather than as input to the model.

Our proposed work on subjective knowledge discovery and Q/A (Thrust 1) fits within a large bodyof work on Q/A and information retrieval. Our work is somewhat related to ‘general purpose’ techniquesfor Q/A, retrieval, and knowledge graph completion [30, 66, 67, 159–163], though these differ widely informulation, and most are quite different from our own take on the topic [12, 15]. More closely relatedare methods that mine knowledge specifically from reviews, either for the purpose of summarization[86–90], or aspect-based opinion mining [91, 92]. A variety of works build Q/A systems upon opiniondata [71, 159, 164–170], though most are limited to specific forms of query (e.g. mining compatibilityrelationships [72]). Part of this limitation of existing work owes to the lack of adequate benchmark datafor general-purpose subjective Q/A, an issue our own work partly addressed through the introductionof a large subjective Q/A dataset [12]. Our work is also related to ideas of knowledge acquisition fromcrowds, especially methods that deal with subjective information [29].

Our work on personalized text generation (Thrust 2) again builds on ideas from aspect-aware rec-ommendation and summarization [78–81, 171], as well as a growing body of work that uses text as aform of side information to improve the performance on traditional predictive tasks (e.g. rating predic-tion) [37, 172, 173], including methods based on CNN and LSTMs [34–36]. We also build on standardmodels for text generation [33, 52, 55, 56, 84], and more specifically review generation [31, 32], thoughmost existing models consider the generative task without personalizing results to individuals.

Finally, our work on personalized generative models (Thrust 3) was initially motivated by the grow-ing body of work on fashion image modeling and recommendation [103–118]. Besides fashion, there isa growing interest in making recommender systems ‘visually aware,’ by exploiting content derived fromimages [51, 101], including our own work [6–10, 13, 14, 60]. Our work also builds upon image gen-eration approaches based on GANs [62, 63] and related evaluation protocols [174]. Such models havebeen adapted to take conditional inputs (e.g. attributes, or textual descriptions) [40–43, 57]. In essenceour contribution shall be a form of conditional GAN that allows for conditioning on individual users.

6 Evaluation Plan6.1 Metrics and Baselines

All research in this proposal shall be measured in terms of baselines, metrics, and datasets estab-lished in prior work, as described in Table 8. In essence, our plan is to demonstrate that accounting forsubjectivity/personalization yields improved performance over models that fail to do so.

Knowledge base completion and Q/A (Thrust 1): For knowledge extraction tasks (RQ1.1), our goalis to extract entity relationships that match labeled groundtruth; we plan to crowdsource a dataset ofat least 10,000 labeled relationships. Groundtruth relations should be identified (recall), and identifiedrelationships should be valid (precision). This follows the protocol used elsewhere for knowledge-basecompletion [66]. We will initially make use of a sample of reviews and answers to questions from ourexisting Amazon corpus; this corpus contains 1.4 million answered questions, meaning that we can use

12

Page 14: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

Category Type Description Source No. of Records Students &Collabs.

Thrust Refs.

e-Commerce General reviews Product reviews, metadata,images

100M T2, T3 [7–11, 14, 19]

Reviews Beer & wine reviews Beeradvocate, RateBeer, Cellar-tracker

4.5M T2 [24, 177]

Reviews Local business reviews Google 11.4M T2 [21]Activities and images Clicks, likes, etc. (Behance) 11.8M actions,

980k imagesT3 [13]

Q/A Q/A data Product-related queries Amazon 1.4M questionsand answers

T1 [12, 16]

Entity relations* Extracted entity relationships RQ1.1 TBD T1Medical dialogs* Medical Q/A and conversa-

tionsUCSD TBD T1

Sequences Dance Choreography DDR step charts StepMania 350k dance moves,35 hours of audio

T2 [20]

Workout logs Heart-rate and GPS data (run-ning and cycling)

EndoMondo, Strava 1M workouts T2

Metropolitan Sensors Road and building sensors UCSD (NSF-IIS-1636879) 40M sensor read-ings

T2

Clinical procedures Procedure durations and char-acteristics

UCSD 117K T2 [22]

Other structuredRecSys data

Peer-to-peer trades Bartering transactions Ratebeer, Tradesy 125k peer-to-peertransactions

[16]

Grocery shopping Grocery shopping transac-tions (baskets)

Anonymous Seattle-based gro-cery store (via collaboration w/Microsoft)

152k transactionsacross 53k trips

[17]

Product bundles Bundled purchases Steam 902K [18]

Table 9: Datasets to be used in this proposal, and associated thrusts and references. All are data either collectedby the PI’s lab, or obtained via an industrial or academic partnership. After publication, all (non-private) dataare released publicly. (*to be collected)

it as a testbed for an end-to-end evaluation of our relation extraction systems, in terms of their ability toresolve queries correctly. Later we will use knowledge extracted from neurotology dialogs following ourcollaboration with the UCSD School of Medicine. For Q/A tasks (RQ1.2), we shall evaluate methods interms of the AUC as in our previous work [12, 15]. Generative models of text (Thrust 2): initial evalua-tion of generative text models (RQ2.1) shall be in terms of perplexity and per-token accuracy measure-ments (i.e., how ‘surprised’ the model is to see held-out data), as we have previously used for generativesequence modeling tasks [20]. Given that such measures do not adequately characterize the quality (or‘plausibility’) of generated text, we will perform further qualitative evaluation via user studies; these willconsist of pairwise preference comparisons between text generated via different approaches (see e.g. theprotocol we used in [176]). We will also consider quantitative retrieval and recommendation measures(RQ2.2), i.e., demonstrating that a richer text model leads to improved MSE/AUC/Hit Rate when re-trieving items (see e.g. [21]). For other sequence modeling tasks, like heartrate modeling, we are alsointerested in measures like the earth-mover’s distance, to compare groundtruth and predicted heartratesequences. Generative image models and design (Thrust 3): As with Thrust 2 above, our initial quantita-tive evaluation (RQ3.1) shall be in terms of whether richer image models improve performance in termsof recommendation and retrieval benchmarks. To evaluate the quality of generated images (RQ3.2), weshall make use of the inception score [62], as well as structural similarity (SSIM) [175] when measuringimage diversity [41, 174]. Since measuring image quality is inherently subjective, we shall also evaluatethe generated images in terms of how well they match user preference models, and through user studies(which will again consist of pairwise preference comparisons) [12, 47].

6.2 Datasets and ReproducibilityA selection of datasets to be used in this proposal (and their associated thrusts) is shown in Ta-

ble 9. These data range from clicks, purchases, text, images, time-series, and social networks, to clinicalprocedures and dance choreography. Note that all are either collected by the PI’s lab, or obtained viaan industrial or academic partnership. Additional sources of public benchmark data (e.g. Yelp [178],MovieLens [179], Epinions [180], Dunnhumby [181], etc.) are also regularly used in our work.

Collecting data is an ongoing effort, with additional data to be collected throughout this project,including medical (Thrust 1) and commercial (Thrust 3) data that will contribute to the broader impactsof our research. The PI has a track record of releasing all datasets immediately after publication, alongwith open-source implementations of all research artifacts. Similarly, all (non-sensitive) data, methods,and code from this proposal will be released publicly. This guarantees that all methods and resultsproduced here shall be reproducible.

13

Page 15: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

task Year 1 Year 2 Year 3 Year 4 Year 51.1a Specify ‘opinion-base’ ontologies / relations1.1b Collect labeled relations via crowdsourcing1.1c Investigate supervised techniques for personalized ER extraction1.1d Evaluate and compare against baselines for RQ1.11.2a Investigate extensions of MoE models from [12] using KB data1.2b Investigate models for subjective question analysis/classification1.2c Evaluate and compare against baselines for RQ1.2 *

Dissemination of RQ1 resultsBroader Impacts: Systems for neurotological dialog and Q/A

2.1a Investigate fully-supervised GCN models (Fig. 7e and f)2.1b Investigate semi-supervised GCN models (Fig. 7g) *2.1c Evaluate and compare against baselines for RQ2.12.2a Investigate joint GCNs/recommender systems (Eq. (2))2.2b Evaluate GCNs for personalized retrieval (incl. user studies)

Dissemination of RQ2 results *Broader impacts: (user× sequence→ sequence) models

3.1a Investigate methods for personalized comparative learning (Fig. 8c)3.1b Evaluate DVBPR and VBPR against alternatives3.2a Investigate GAN methods for personalized generation (Fig. 8e and f)3.2b Investigate joint training schemes for RecSys+GAN training (Fig. 8g)3.2c Evaluate RQ3.2 methods (incl. user studies) *

Dissemination of RQ3 results * *Broader Impacts: Development of personalized VQA models

Table 10: Five-year timeline and collaborators (Table 11). (*Student is expected to graduate during this period).

6.3 Advising and Collaboration PlanResearch in this proposal will be conducted by the PhD. students in the PI’s lab (Table 11), as well

as our academic and industrial collaborators (see letters of collaboration attached). Thrust 1 will beconducted by , building on her prior work on Opinion Q/A systems [15]. Thrust 2 willinitially be conducted by ; following his graduation this research will be conducted by

. Thrust 3 builds upon prior work by [9]; following his graduation work will be conductedby . developed the preliminary experiments reported in this proposal [46, 47, 85].

PhD. StudentsAbbr. Student Graduation

PhD, 2017PhD, 2018

† PhD, 2020† PhD, 2021

PhD, 2021PhD, 2021

∗ PhD, 2019∗ PhD, 2019

∗† PhD, 2019∗co-advised; CD with ; LM with

; IS with .Other student collaborators

Undergraduate, UCSD† Undergraduate, UCSD

† Undergraduate, UCSD† Undergraduate, UCSD

Undergraduate, UCSDUndergraduate/MS, UCSDMS, UCSDMS, UCSDMS, UCSD

Other academic and industrial collaboratorsUCSD (School of Medicine)Neurotology ClinicianStanfordUCSDUCSD

Table 11: PI’s lab. †Female / URM

Broader impacts will be achieved through our industrialand academic partnerships. Our work on Q/A for patient in-take will be conducted in collaboration with and

; this work is is connected with an ongoing pi-lot study that will collect data and develop initial prototypesover the next 18 months; our use of UCSD’s clinical recorddata is covered by IRB (April 2016), with additionalapprovals to be obtained as needed. Our work on sequencemodeling for heartrate data will be conducted through by

. Finally our work on artistic preferences andpersonalized VQA will be conducted with and .Timeline and funding allocation. Our thrusts are staggeredto coincide with graduation of senior students andthe training of new students ; we aim to completeThrusts 1, 2, and 3 in years 3, 4, and 5. is funded byan MSR fellowship during years 1-2; will graduate in 2018and will help to train ; and are new students partiallyfunded by departmental scholarships during years 1-2. Thus weexpect funding to be allocated between during year3, and during years 1,2,4, and 5. An approximatetimeline for these activities is shown in Table 10.

Although our timeline involves multiple tasks in parallel,our advising plan guarantees that students in our lab are en-couraged to pursue interdisciplinary and collaborative research,exposing them to ideas from departments across campus, andputting their ideas into practice via industrial partnerships. Each of the collaborations described aboveare established working relationships, involving regular meetings between the PI, collaborators, andadvisees.

14

Page 16: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

7 Educational Plan and Outreach ActivitiesMentorship. In addition to my PhD. advising activities, I am actively involved in mentorship of studentsfrom diverse backgrounds, ranging from high-schoolers to engineering professionals. I am an advisor inUCSD’s Early Research Scholars Program (ERSP), a mentorship program aimed at increasing retentionof women and underrepresented minorities in CS by directly involving them in undergraduate research.I have advised nine undergraduates through this program across three research projects (SL, ZQ, SH dur-ing 2017). I am also an advisor in UCSD’s MAS program, a graduate program for working, engineeringprofessionals. Several of my advisees are women and underrepresented minorities (Table 11).Outreach. Recruitment and retention: I regularly engage in activities aimed at fostering interest in com-puter science amongst people of all ages: I have participated in events with the UC High Coding Club(high-schoolers), with the UC Village (transfer students), the UCSD CSE SPIS program (new admits),and as an advisor in the ERSP (aimed at increasing retention of URMs). Community Workshops: I amactively involved in organizing community workshops on machine learning and data mining. In 2016/17I organized the Workshop on Mining and Learning with Graphs (MLG); the Workshop on Big Graphs,Theory and Practice; the Workshop on Machine Learning meets Fashion; and the Southern CaliforniaMachine Learning Symposium. The latter was a standalone event that attracted 43 papers, 7 corporatesponsors, and around 300 attendees. SDM Doctoral Forum: In 2018 I will serve as Chair of the SDMDoctoral Forum, an NSF-sponsored event that provides opportunities for Ph.D. students to discuss andreceive feedback on their research from peers and senior practitioners in the field.Integrating Research and Teaching. My teaching focuses on a ‘hands-on’ approach to machine learn-ing, data mining, and recommender systems, by allowing students to work with large-scale datasets,solving predictive tasks that come directly from real-world applications. For example, my studentscompete on Kaggle to develop recommendation engines, using large review datasets from my own re-search (Table 9). Each week I cover case-studies from academic literature; this integration betweenfundamental concepts and cutting-edge research has proven to be among the most positively evaluatedaspects of my classes, especially among undergraduates. In 2016/17 I have developed new (graduateand undergraduate) classes on Web Mining and Recommender Systems, which have quickly grown tobecome the most popular electives (by enrollment) in our department; following the research in thisproposal I plan to incorporate sections on generative and structured output modeling into my existingclasses. In 2017/18 I will develop another course that focuses on large-scale implementation of modernrecommender systems research.Recommender Systems for Student Admissions. One of my roles in our department is as Chair of ourMS recruiting committee. This is a demanding role, given the need to review over 3,000 applications—and around 10,000 letters of reference—per year. In the last year I have been analyzing our historicaladmissions with a goal to build recommender systems to (a) increase the efficiency of our admissionsprocess; (b) identify the strongest files early, as well as ‘outliers’ that need special attention; and (c) toreduce bias in our admissions process. For the latter, even simple analytics (e.g. what percentile is a3.8 GPA at this university? How often does this letter writer say ‘top 1%’? etc.) can drastically increasethe efficiency of the process, while also increasing its fairness. I am also collaborating with NdapaNakashole (a new faculty member in our department) to analyze the use of gendered language in CSletters of recommendation, and the effect that it has on admissions decisions; this shall benefit from ourresults in Thrust 2, as this task inherently requires the analysis of subjective language.Developing UCSD’s Data Science Program. I am working to develop of UCSD’s new Data Scienceundergraduate major and minor programs that are launching in 2017/18. My course will be incorpo-rated into the new curriculum, and I am also working on the development of new introductory courses(DS10/20/30) to be offered in the program’s lower division.

Results from Prior NSF SupportThe PI is supported by NSF-IIS-1636879 (“Knowledge discovery and real-time interventions from

sensory data flows in urban spaces,” $517,917, with Altintas, I., Gadh, R., Gupta, R., Shutters, S., andSrivastava, M.). This project is currently in its first year of funding, and has so far led to two publications(with RH and RP) [10, 21], as well as the collection of large datasets of urban sensor measurements(Table 9). This project is concerned with building predictive models of urban sensors (e.g. buildingenergy usage, weather, parking availability, EV charging, etc.), and as such connects to our broaderimpacts in Thrust 2. This project supports the collection of new heterogeneous sequence datasets, andconversely shall benefit from the new structured modeling techniques to be developed here.

15

Page 17: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

8 References

[1] Jaewon Yang, Julian McAuley, and Jure Leskovec. Detecting cohesive and 2-mode communitiesin directed and undirected networks. In Web Search and Data Mining, 2014.

[2] T. Zhao, J. J. McAuley, and I. King. Leveraging social connections to improve personalizedranking for collaborative filtering. In Conference on Information and Knowledge Management,2014.

[3] T. Zhao, J. J. McAuley, and I. King. Improving latent factor models via personalized featureprojection for one-class recommendation. In Conference on Information and Knowledge Man-agement, 2015.

[4] D. Lim, J. McAuley, and G. Lanckriet. Top-n recommendation with missing implicit feedback.In ACM Conference on Recommender Systems, 2015.

[5] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and com-plementary products. In Knowledge Discovery and Data Mining, 2015.

[6] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-basedrecommendations on style and substitutes. In SIGIR, 2015.

[7] Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and Serge Belongie.Learning visual clothing style with heterogeneous dyadic co-occurrences. In ICCV, 2015.

[8] R. He, C. Lin, J. Wang, and J. McAuley. Sparse hierarchical embeddings for visually-aware one-class collaborative filtering. In International Joint Conference on Artificial Intelligence, 2016.

[9] R. He and J. McAuley. VBPR: visual bayesian personalized ranking from implicit feedback. InAAAI, 2016.

[10] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends withone-class collaborative filtering. In World Wide Web, 2016.

[11] R. He and J. McAuley. Fusing similarity models with markov chains for sparse sequential rec-ommendation. In International Conference on Data Mining, 2016.

[12] Julian McAuley and Alex Yang. Addressing complex and subjective product-related queries withcustomer reviews. In World Wide Web, 2016.

[13] R. He, C. Fang, Z. Wang, and J. McAuley. Vista: A visually, socially, and temporally-awaremodel for artistic recommendation. In RecSys, 2016.

[14] R. He, C. Packer, and J. McAuley. Learning compatibility across categories for heterogeneousitem recommendation. In International Conference on Data Mining, 2016.

[15] M. Wan and J. McAuley. Modeling ambiguity, subjectivity, and diverging viewpoints in opinionquestion answering systems. In ICDM, 2016.

[16] J. Rappaz, M.-L. Vladarean, J. McAuley, and M. Catasta. Bartering books to beers: A recom-mender system for exchange platforms. In Web Search and Data Mining, 2017.

[17] M. Wan, D. Wang, M. Goldman, M. Taddy, J. Rao, J. Liu, D. Lymberopoulos, and J. McAuley.Modeling consumer preferences and price sensitivities from large-scale grocery shopping trans-action logs. In World Wide Web, 2017.

1

Page 18: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[18] A. Pathak, K. Gupta, and J. McAuley. Generating and personalizing bundle recommendations onsteam. In SIGIR, 2017.

[19] C. Cai, R. He, and J. McAuley. SPMC: Socially-aware personalized markov chains for sparsesequential recommendation. In International Joint Conference on Artificial Intelligence, 2017.

[20] C. Donahue, Z. Lipton, and J. McAuley. Dance dance convolution. In International Conferenceon Machine Learning, 2017.

[21] R. He, W.-C. Kang, and J. McAuley. Translation-based recommendation. In RecSys, 2017.

[22] N. Ng, R. Gabriel, J. McAuley, C. Elkan, and Z. Lipton. Predicting surgery duration with neuralheteroscedastic regression. In Machine Learning in Health Care, 2017.

[23] Julian McAuley and Jure Leskovec. From amateurs to connoisseurs: modeling the evolution ofuser expertise through online reviews. In World Wide Web, 2013.

[24] Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding ratingdimensions with review text. In ACM Conference on Recommender Systems, 2013.

[25] Julian McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In NeuralInformation Processing Systems, 2012.

[26] Himabindu Lakkaraju, Julian McAuley, and Jure Leskovec. What’s in a name? understanding theinterplay between titles, content, and communities in social media. In International Conferenceon Weblogs and Social Media, 2013.

[27] Julian McAuley, Jure Leskovec, and Dan Jurafsky. Learning attitudes and attributes from multi-aspect reviews. In International Conference on Data Mining, 2012.

[28] Dawei Chen, Lexing Xie, Aditya Krishna Menon, and Cheng Soon Ong. Structured recommen-dation. arXiv preprint arXiv:1706.09067, 2017.

[29] Rui Meng, Hao Xin, Lei Chen, and Yangqiu Song. Subjective knowledge acquisition and enrich-ment powered by crowdsourcing. arXiv preprint arXiv:1705.05720, 2017.

[30] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyan-pur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty.Building Watson: An overview of the DeepQA project. In AI Magazine, 2010.

[31] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discoveringsentiment. CoRR, abs/1704.01444, 2017.

[32] Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu. Learning to generateproduct reviews from attributes. In Association for Computational Linguistics, 2017.

[33] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Controllabletext generation. CoRR, abs/1703.00955, 2017.

[34] Lei Zheng, Vahid Noroozi, and Philip S. Yu. Joint deep modeling of users and items using reviewsfor recommendation. In Web Search and Data Mining, 2017.

[35] Rose Catherine and William W. Cohen. Transnets: Learning to transform for recommendation.CoRR, abs/1704.02298, 2017.

[36] Chao-Yuan Wu, Amr Ahmed, and Alexander J. Smola Alex Beutel. Joint training of ratings andreviews with recurrent recommender networks. In ICLR Workshops, 2017.

2

Page 19: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[37] Guang-Neng Hu. Integrating reviews into personalized ranking for cold start recommendation.In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017.

[38] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprintarXiv:1502.01710, 2015.

[39] Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. Neural rating regression withabstractive tips generation for recommendation. In SIGIR, 2017.

[40] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784, 2014.

[41] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with aux-iliary classifier gans. In ICML, 2017.

[42] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional imagegeneration from visual attributes. In ECCV, 2016.

[43] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizingthe preferred inputs for neurons in neural networks via deep generator networks. In NIPS, 2016.

[44] Yehuda Koren and Robert Bell. Advances in collaborative filtering. In Recommender SystemsHandbook. Springer, 2011.

[45] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. BPR:bayesian personalized ranking from implicit feedback. In UAI, 2009.

[46] Zachary Lipton, Sharad Vikram, and Julian McAuley. Generative concatenative nets jointly learnto write and classify reviews. in preparation.

[47] Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. Visually-aware fashionrecommendation and design with generative image models. in submission.

[48] Steffen Rendle. Factorization machines. In ICDM, 2010.

[49] Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. Functional matrix factorizations for cold-startrecommendation. In SIGIR, 2011.

[50] Trung V Nguyen, Alexandros Karatzoglou, and Linas Baltrunas. Gaussian process factorizationmachines for context-aware recommendations. In SIGIR, 2014.

[51] Chun-Che Wu, Tao Mei, Winston H Hsu, and Yong Rui. Learning to personalize trending imagesearch suggestion. In SIGIR, 2014.

[52] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neuralimage caption generator. In IEEE Conference on Computer Vision and Pattern Recognition,2015.

[53] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, andYoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention.arXiv preprint arXiv:1502.03044, 2015.

[54] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural net-works. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.

[55] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring thelimits of language modeling. CoRR, abs/1602.02410, 2016.

3

Page 20: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[56] Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer.Affect-lm: A neural language model for customizable affective text generation. CoRR,abs/1704.06851, 2017.

[57] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and HonglakLee. Generative adversarial text to image synthesis. In ICML, 2016.

[58] Jaewon Yang, Julian McAuley, Jure Leskovec, Paea LePendu, and Nigam Shah. Finding progres-sion stages in time-evolving event sequences. In World Wide Web, 2014.

[59] Jaewon Yang, Julian McAuley, and Jure Leskovec. Community detection in networks with nodeattributes. In International Conference on Data Mining, 2013.

[60] J. J. McAuley and J. Leskovec. Image labeling on a network: using social-network metadata forimage classification. In European Conference on Computer Vision, 2012.

[61] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariantmapping. In CVPR, 2006.

[62] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. In NIPS, 2016.

[63] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.

[64] Ryen White, Sheng Wang, Apurv Pant, Rave Harpaz, Pushpraj Shukla, Walter Sun, WilliamDuMouchel, and Eric Horvitz. Early identification of adverse drug reactions from search logdata. Journal of Biomedical Informatics, 2016.

[65] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Re. Incremen-tal knowledge base construction using deepdive. In VLDB, 2015.

[66] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relationembeddings for knowledge graph completion. In AAAI, 2015.

[67] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, , and J. Taylor. Freebase: a collaboratively createdgraph database for structuring human knowledge. In KDD, 2008.

[68] Vikas Raykar, Shipeng Yu, Linda Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bo-goni, and Linda Moy. Learning from crowds. JMLR, 2010.

[69] Aditya Menon and Charles Elkan. Link prediction via matrix factorization. Machine Learningand Knowledge Discovery in Databases, pages 437–452, 2011.

[70] Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao. Knowledge-based question answering asmachine translation. Cell, 2(6), 2014.

[71] Bushra Anjum and Chaman L Sabharwal. An entropy based product ranking algorithm usingreviews and q&a data. In DMS, pages 69–76, 2016.

[72] Hu Xu, Lei Shu, Jingyuan Zhang, and Philip S Yu. Mining compatible/incompatible entities fromquestion and answering via yes/no answer classification using distant label expansion. arXivpreprint arXiv:1612.04499, 2016.

[73] Yang Bao, Hui Fang, and Jie Zhang. TopicMF: Simultaneously exploiting ratings and reviews forrecommendation. In AAAI, 2014.

4

Page 21: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[74] Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang. Content-based collabora-tive filtering for news topic recommendation. In AAAI, 2015.

[75] Bhargav Kanagal, Amr Ahmed, Sandeep Pandey, Vanja Josifovski, Jeff Yuan, and Lluis Garcia-Pueyo. Supercharging recommender systems using taxonomies for learning user purchase behav-ior. In VLDB, 2012.

[76] Zhi Qiao, Peng Zhang, Yanan Cao, Chuan Zhou, Li Guo, and Binxing Fang. Combining het-erogenous social and geographical information for event recommendation. In AAAI, 2014.

[77] J. McAuley, J. Leskovec, and D. Jurafsky. Learning attitudes and attributes from multi-aspectreviews. In ICDM, 2012.

[78] S. Blair-Goldensohn, T. Neylon, K. Hannan, G. Reis, R. McDonald, and J. Reynar. Building asentiment summarizer for local service reviews. In NLP in the Information Explosion Era, 2008.

[79] S. Brody and N. Elhadad. An unsupervised aspect-sentiment model for online reviews. In ACL,2010.

[80] G. Ganu, N. Elhadad, and A. Marian. Beyond the stars: Improving rating predictions using reviewtext content. In WebDB, 2009.

[81] N. Gupta, G. Di Fabbrizio, and P. Haffner. Capturing the stars: predicting ratings for service andproduct reviews. In HLT Workshops, 2010.

[82] Jeffrey L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

[83] Ilya Sutskever, James Martens, and Geoffrey E. Hinton. Generating text with recurrent neuralnetworks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11),pages 1017–1024, 2011.

[84] Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.

[85] Jianmo Ni, Zachary Lipton, Sharad Vikram, and Julian McAuley. Estimating reactions and rec-ommending products with generative models of reviews. in submission.

[86] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In SIGKDD, 2004.

[87] Giuseppe Carenini, Jackie Chi Kit Cheung, and Adam Pauls. Multi-document summarization ofevaluative text. Computational Intelligence, 2013.

[88] Rahul Katragadda and Vasudeva Varma. Query-focused summaries or query-biased summaries?In ACL-IJCNLP, 2009.

[89] Xinfan Meng and Houfeng Wang. Mining user reviews: from specification to summarization. InACL-IJCNLP, 2009.

[90] Jian Liu, Gengfeng Wu, and Jianxin Yao. Opinion searching in multi-product reviews. In CIT,2006.

[91] Samaneh Moghaddam and Martin Ester. AQA: aspect-based opinion question answering. InICDMW, 2011.

[92] Hong Yu and Vasileios Hatzivassiloglou. Towards answering opinion questions: Separating factsfrom opinions and identifying the polarity of opinion sentences. In EMNLP, 2003.

[93] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face verificationin the wild. In CVPR, 2014.

5

Page 22: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[94] Sean Bell and Kavita Bala. Learning visual similarity for product design with convolutionalneural networks. ACM Transactions on Graphics (TOG), 34(4):98, 2015.

[95] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. Comparative deep learningof hybrid representations for image recommendations. In CVPR, 2016.

[96] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommendersystems. In SIGKDD, 2015.

[97] Ruining He and Julian McAuley. VBPR: Visual bayesian personalized ranking from implicitfeedback. In AAAI, 2016.

[98] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

[99] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image modelsusing a laplacian pyramid of adversarial networks. In NIPS, 2015.

[100] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[101] Devashish Shankar, Sujay Narumanchi, HA Ananya, Pramod Kompalli, and Krishnendu Chaud-hury. Deep learning based large scale visual recommendation and search for e-commerce. arXivpreprint arXiv:1703.02344, 2017.

[102] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.

[103] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What makesparis look like paris? SIGGRAPH, 2012.

[104] Diane J. Hu, Rob Hall, and Josh Attenberg. Style in the long tail: Discovering unique interestswith latent variable models in large scale social e-commerce. In KDD, 2014.

[105] Wei Di, Catherine Wah, Anurag Bhardwaj, Robinson Piramuthu, and Neel Sundaresan. Stylefinder: Fine-grained clothing style detection and retrieval. In CVPR Workshops, 2013.

[106] Vignesh Jagadeesh, Robinson Piramuthu, Anurag Bhardwaj, Wei Di, and Neel Sundaresan. Largescale visual recommendations from street fashion images. arXiv:1401.1778, 2014.

[107] Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. Getting the look: Clothing recognition andsegmentation for automatic product suggestions in everyday photos. In ICMR, 2013.

[108] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and ShuichengYan. Hi, magic closet, tell me what to wear! In ACM Conference on Multimedia, 2012.

[109] Edgar Simo-Serra and Hiroshi Ishikawa. Fashion style in 128 floats: Joint ranking and classifica-tion using weak data for feature extraction. In CVPR, 2016.

[110] Kota Yamaguchi, M Hadi Kiapour, and Tamara L Berg. Paper doll parsing: Retrieving similarstyles to parse clothing items. In ICCV, 2013.

[111] Wei Yang, Ping Luo, and Liang Lin. Clothing co-parsing by joint image segmentation and label-ing. In CVPR, 2014.

[112] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, and Raquel Urtasun. Neuroaestheticsin fashion: Modeling the perception of fashionability. In CVPR, 2015.

6

Page 23: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[113] Iljung Sam Kwak, Ana C. Murillo, Peter Belhumeur, Serge Belongie, and David Kriegman. Frombikers to surfers: Visual recognition of urban tribes. In BMVC, 2013.

[114] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trendswith one-class collaborative filtering. In WWW, 2016.

[115] Ana C Murillo, Iljung S Kwak, Lubomir Bourdev, David Kriegman, and Serge Belongie. Urbantribes: Analyzing group photos from a social perspective. In CVPR Workshops, 2012.

[116] Sirion Vittayakorn, Kota Yamaguchi, Alexander C Berg, and Tamara L Berg. Runway to realway:Visual analysis of fashion. In WACV, 2015.

[117] Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, Till Quack, and LucVan Gool. Apparel classification with style. In ACCV, 2013.

[118] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild withthe materials in context database. In Computer Vision and Pattern Recognition (CVPR), 2015IEEE Conference on, pages 3479–3487, 2015.

[119] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarialnets with policy gradient. In AAAI, pages 2852–2858, 2017.

[120] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neuraldialogue generation. arXiv preprint arXiv:1701.06547, 2017.

[121] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Improving neural machine translation withconditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887, 2017.

[122] Steffen Rendle. Factorization machines. In ICDM, 2010.

[123] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization foronline recommendation with implicit feedback. In SIGIR, 2016.

[124] Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Mymedialite:a free recommender system library. In RecSys, 2011.

[125] Robert M. Bell, Yehuda Koren, and Chris Volinsky. The bellkor solution to the netflix prize, 2007.

[126] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedbackdatasets. In ICDM, 2008.

[127] James Bennett and Stan Lanning. The netflix prize. In KDDCup, 2007.

[128] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul Kantor, editors. Recommender SystemsHandbook. Springer, 2011.

[129] Defu Lian, Yong Ge, Fuzheng Zhang, Nicholas Jing Yuan, Xing Xie, Tao Zhou, and Yong Rui.Content-aware collaborative filtering for location recommendation based on human mobility data.In ICDM, 2015.

[130] Maciej Kula. Metadata embeddings for user and item cold-start recommendations. arXiv preprintarXiv:1507.08439, 2015.

[131] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Sorec: social recommendation usingprobabilistic matrix factorization. In CIKM, 2008.

[132] Zijun Yao, Yanjie Fu, Bin Liu, Yanchi Liu, and Hui Xiong. POI recommendation: A temporalmatching between POI popularity and user regularity. In ICDM, 2016.

7

Page 24: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[133] Subhabrata Mukherjee, Hemank Lamba, and Gerhard Weikum. Experience-aware item recom-mendation in evolving review communities. In ICDM, 2015.

[134] Jingyuan Yang, Chuanren Liu, Mingfei Teng, Hui Xiong, March Liao, and Vivian Zhu. Exploitingtemporal and social factors for B2B marketing campaign recommendations. In ICDM, 2015.

[135] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalizedmarkov chains for next-basket recommendation. In WWW, 2010.

[136] Andrew Zimdars, David Maxwell Chickering, and Christopher Meek. Using temporal data formaking recommendations. In UAI, 2001.

[137] Guy Shani, Ronen I Brafman, and David Heckerman. An mdp-based recommender system. InUAI, 2002.

[138] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. Using sequential and non-sequential patterns in predictive web usage mining tasks. In ICDM, 2002.

[139] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. Mining concept-drifting data streams usingensemble classifiers. In SIGKDD, 2003.

[140] Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelli-gent Data Analysis, 2004.

[141] Yi Ding and Xue Li. Time weight collaborative filtering. In CIKM, 2005.

[142] Yehuda Koren. Collaborative filtering with temporal dynamics. Communications of the ACM,2010.

[143] Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars Schmidt-Thieme.Multi-relational matrix factorization using bayesian personalized ranking for social network data.In WSDM, 2012.

[144] Weike Pan and Li Chen. Gbpr: Group preference based bayesian personalized ranking for one-class collaborative filtering. In IJCAI, 2013.

[145] Guibing Guo, Jie Zhang, and Neil Yorke-Smith. Trustsvd: Collaborative filtering with both theexplicit and implicit influence of user trust and of item ratings. In AAAI, 2015.

[146] Allison JB Chaney, David M Blei, and Tina Eliassi-Rad. A probabilistic model for using socialnetworks in personalized item recommendation. In RecSys, 2015.

[147] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, LichanHong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommendersystems. In Workshop on Deep Learning for Recommender Systems, 2016.

[148] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based musicrecommendation. In NIPS, 2013.

[149] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommenda-tions. In RecSys, 2016.

[150] Milind Naphade, John R Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Amd LyndonKennedy, Alexander Hauptmann, and Jon Curtis. Large-scale concept ontology for multimedia.IEEE Multimedia, 2006.

8

Page 25: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[151] Tao Chen, Felix Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. Object-basedvisual sentiment concept analysis and application. In ACM Conference on Multimedia, 2014.

[152] Nikhil Rasiwasia, Pedro Mereno, and Nuno Vasconcelos. Bridging the gap: Query by semanticexample. IEEE Transactions on Multimedia, 2007.

[153] John Smith and Shih-Fu Chang. VisualSEEk: a fully automated content-based image querysystem. In ACM Conference on Multimedia, 1996.

[154] Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content-based multimedia in-formation retrieval: State of the art and challenges. ACM TOMCCAP, 2006.

[155] Dong liu, Guangnan Ye, Ching-Ting Chen, Shuicheng Yan, and Shih-Fu Chang. Hybrid socialmedia network. In ACM Conference on Multimedia, 2012.

[156] Malcolm Slaney. Web-scale multimedia analysis: Does content matter? http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5754643, 2011.

[157] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert Lanckriet, RogerLevy, and Nuno Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACMConferenc on Multimedia, 2010.

[158] Cees Snoek and Marcel Worring. Multimodal video indexing: A review of the state-of-the-art.Multimedia Tools and Applications, 2005.

[159] Jing He and Decheng Dai. Summarization of yes/no questions using a feature function model.JMLR, 2011.

[160] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowledgebases. In AAAI, 2011.

[161] Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM,2011.

[162] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL Workshop onText Summarization Branches Out, 2004.

[163] K Jones, Steve Walker, and Stephen Robertson. A probabilistic model of information retrieval:development and comparative experiments. Information Processing & Management, 2000.

[164] Jianxing Yu, Zheng-Jun Zha, and Tat-Seng Chua. Answering opinion questions on products byexploiting hierarchical organization of consumer reviews. In EMNLP-CoNLL, 2012.

[165] Alexandra Balahur, Ester Boldrini, Andres Montoyo, and Patricio Martınez-Barco. Going beyondtraditional QA systems: challenges and keys in opinion question answering. In COLING, 2010.

[166] Alexandra Balahur, Ester Boldrini, Andres Montoyo, and Patricio Martınez-Barco. Opinion ques-tion answering: Towards a unified approach. In ECAI, 2010.

[167] Alexandra Balahur, Ester Boldrini, Andres Montoyo, and Patricio Martinez-Barco. Opinion andgeneric question answering systems: a performance analysis. In ACL-IJCNLP, 2009.

[168] Fangtao Li, Yang Tang, Minlie Huang, and Xiaoyan Zhu. Answering opinion questions withrandom walks on graphs. In ACL-IJCNLP, 2009.

[169] Veselin Stoyanov, Claire Cardie, and Janyce Wiebe. Multi-perspective question answering usingthe OpQA corpus. In HLT/EMNLP, 2005.

9

Page 26: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

[170] Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. Question analysis and answer passage retrievalfor opinion question answering systems. In ROCLING, 2007.

[171] I. Titov and R. McDonald. A joint model of text and aspect ratings for sentiment summarization.In ACL, 2008.

[172] Guang Ling, Michael R Lyu, and Irwin King. Ratings meet reviews, a combined approach torecommend. In Proceedings of the 8th ACM Conference on Recommender systems, pages 105–112. ACM, 2014.

[173] Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and Chong Wang.Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Pro-ceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and datamining, pages 193–202. ACM, 2014.

[174] Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generativemodels. In ICLR, 2016.

[175] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment:from error visibility to structural similarity. TIP, 2004.

[176] Chris Donahue, Akshay Balsubramani, Julian McAuley, and Zachary C Lipton. Semantically de-composing the latent spaces of generative adversarial networks. arXiv preprint arXiv:1705.07904,2017.

[177] J. J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimen-sions with review text. In ACM Conference on Recommender Systems, 2013.

[178] Yelp. Yelp Dataset Challenge. https://www.yelp.com/dataset_challenge, 2013.

[179] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACMTransactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.

[180] Jure Leskovec and Andrej Krevl. {SNAP Datasets}:{Stanford} large network dataset collection.2015.

[181] Dunnhumby. Dunnhumby Dataset. https://www.dunnhumby.com/sourcefiles.

10

Page 27: CAREER: Structured Output Models of Recommendations ...cseweb.ucsd.edu/~jmcauley/pdfs/career.pdfCAREER: Structured Output Models of Recommendations, Activities, and Behavior: project

9 Data Manamegement PlanExpected Data to be Managed

Data to be used in this proposal are listed below (see Table 9 in the Narrative). Data span records in-cluding clicks, purchases, co-purchases, trades, likes, reviews, images, questions-and-answers, heartratelogs, metropolitan senors, clinical records, and dance choreography:

Category Type Description Source No. of Records Students &Collabs.

Thrust Refs.

e-Commerce General reviews Product reviews, metadata,images

Amazon 100M MW, WK T2, T3 [7–11, 14, 19]

Reviews Beer & wine reviews Beeradvocate, RateBeer, Cellar-tracker

4.5M ZL, JN T2 [24, 177]

Reviews Local business reviews Google 11.4M RH, WK T2 [21]Activities and images Clicks, likes, etc. Adobe (Behance) 11.8M actions,

980k imagesCF, WK T3 [13]

Q/A Q/A data Product-related queries Amazon 1.4M questionsand answers

MW T1 [12, 16]

Entity relations* Extracted entity relationships RQ1.1 TBD MW T1Medical dialogs* Medical Q/A and conversa-

tionsUCSD TBD EV, CH T1

Sequences Dance Choreography DDR step charts StepMania 350k dance moves,35 hours of audio

CD, ZL T2 [20]

Workout logs Heart-rate and GPS data (run-ning and cycling)

EndoMondo, Strava 1M workouts MM, LM T2

Metropolitan Sensors Road and building sensors UCSD (NSF-IIS-1636879) 40M sensor read-ings

RG, RP T2

Clinical procedures Procedure durations and char-acteristics

UCSD 117K ZL, NN T2 [22]

Other structuredRecSys data

Peer-to-peer trades Bartering transactions Ratebeer, Tradesy 125k peer-to-peertransactions

MC [16]

Grocery shopping Grocery shopping transac-tions (baskets)

Anonymous Seattle-based gro-cery store (via collaboration w/Microsoft)

152k transactionsacross 53k trips

MW, DW [17]

Product bundles Bundled purchases Steam 902K AP, KG [18]

The majority of data required for initial evaluation has already been collected. Our lab has a trackrecord of providing reference implementations and complete data from all academic publications todate;1 similarly, wherever possible, data collected in this proposal will be made available. We note thefollowing exceptions:

1. Proprietary datasets, i.e., those collected from industrial collaborations, will not be released, andshall be maintained securely. Where possible, we will release (anonymized) public subsets of thedata, as we have done previously with our Behance data during our collaboration with Adobe [13].Doing so ensures that our results are verifiable and extensible, even where private data are used.

2. Sensitive data (e.g. clinical notes) will not be released, and shall be maintained securely. Our useof UCSD’s clinical records data is covered under IRB #160410, with additional approvals to beobtained as needed.

3. Collection of additional data for neurotology patient intake and triage (Thrust 1) will be done incollaboration with Chun-Nan Hsu (Professor of Medicine, UCSD), and Erik Viirre (a practicingphysician). Data collection for this project is expected to be completed by Q2 2019.

Collecting additional datasets is a substantial and ongoing effort by our lab, and necessary for theadvancement of research on behavioral models and recommender systems. Many of the datasets col-lected by our lab have become de facto standards in industry and academia, and are widely used in datamining and machine learning classes worldwide.

DisseminationAs per the NSF policy on the dissemination and sharing of research results, all research data shall

be made available at the time of publication, and shall remain available for a minimum of three years.

1http://cseweb.ucsd.edu/˜jmcauley/

1