PlateClick: Bootstrapping Food Preferences Through an Adaptive ...

PlateClick: Bootstrapping Food Preferences Through anAdaptive Visual Interface

Longqi Yang†,‡, Yin Cui†,‡, Fan Zhang†,‡, John P. Pollak‡, Serge Belongie†,‡, Deborah Estrin†,‡†Department of Computer Science, Cornell University; ‡Cornell Tech

{ylongqi, ycui, fanz}@cs.cornell.edu; {jpp9, sjb344, de226}@cornell.edu

ABSTRACTFood preference learning is an important component of well-ness applications and restaurant recommender systems as itprovides personalized information for effective food target-ing and suggestions. However, existing systems require someform of food journaling to create a historical record of an in-dividual’s meal selections. In addition, current interfaces forfood or restaurant preference elicitation rely extensively ontext-based descriptions and rating methods, which can im-pose high cognitive load, thereby hampering wide adoption.In this paper, we propose PlateClick, a novel system that

bootstraps food preference using a simple, visual quiz-baseduser interface. We leverage a pairwise comparison approachwith only visual content. Using over 10,028 recipes collectedfrom Yummly, we design a deep convolutional neural net-work (CNN) to learn the similarity distance metric betweenfood images. Our model is shown to outperform state-of-the-art CNN by 4 times in terms of mean Average Preci-sion. We explore a novel online learning framework that issuitable for learning users’ preferences across a large scaledataset based on a small number of interactions (≤ 15). Ouronline learning approach balances exploitation-explorationand takes advantage of food similarities using preference-propagation in locally connected graphs.We evaluated our system in a field study of 227 anony-

mous users. The results demonstrate that our method out-performs other baselines by a significant margin, and thelearning process can be completed in less than one minute.In summary, PlateClick provides a light-weight, immersiveuser experience for efficient food preference elicitation.

Categories and Subject DescriptorsH.4.m [Information Systems Applications]: Miscella-neous; I.2.6 [Artificial Intelligence]: Learning

KeywordsFood preference elicitation; visual interface; online learning

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’15 19-23 October, 2015 Melbourne, Australiac© 2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00.DOI: http://dx.doi.org/10.1145/2806416.2806544.

1. INTRODUCTIONThe problem of capturing and understanding people’s food

preferences has attracted substantial attention from indus-try (e.g., Yelp and Foursquare) and academia [12,13,17,38].Food preferences guide our diet choices [32] which in turnhave a strong effect on our personal health and social lives[32]. A recipe advice system [13, 27] could more effectivelycoach users to prepare healthier meals at home if alternativefood suggestions provided were appealing to them. This isimportant because healthy diet recommendations are of nobenefit if users don’t adopt them. Another application areathat could leverage user preferences is commercial restau-rant recommender systems like Yelp and Foursquare. Therecommendations will be more accurate and personalized ifthe system output is tuned to the user’s diet profile. How-ever, food preference is notoriously difficult to learn becauseof its dependence on context (e.g., evolving personal goals).Existing systems and algorithms suffer from several limi-tations that interfere with efficient learning and wide useradoption:

Data sparsity. Current food preference learning systemsrequire longitudinal records of the meals that users haveeaten [12, 13, 17, 38]. These historical data traces [11] typi-cally come from location sensing, which is not always relatedto food preference, or burdensome food journaling, which isoften abandoned after short periods of adoption [9]. As aresult, data points are too sparse to provide enough foodpreference information.

High cognitive load. Preference elicitation [29], in whichusers are asked to rate different food items explicitly (on ascale from 1 to 5) [12, 17], is an approach that is comple-mentary to the above longitudinal methods. While text-based instruments and rating methods [5, 10, 35, 43] effec-tively address the cold-start problem [23, 34, 42] in movierecommender systems, the counterpart for food preferenceis especially hard for users [9] as it is time consuming [28]and presents a high cognitive load [8].

Insufficient Visual Understanding. Traditional meth-ods mainly use food tags and other metadata for learningtasks [12,17]. However, as people’s diet decisions are greatlyinfluenced by visual appearances of meals [9], analysis of im-age features can provide a valuable signal for diet profilingand food preference elicitation.

In this paper, we propose PlateClick, a novel systemfor efficient food preference elicitation using a sim-ple, visual quiz-based user interface. PlateClick allevi-ates the limitations mentioned above with deep understand-ing of food images and a user-friendly visual interface. To

Figure 1: PlateClick system pipeline. The system isdivided into two major parts: Offline preprocessing,which is surrounded by green dotted line and On-line preference learning, which is surrounded by bluedotted line.

the best of our knowledge, this is the first system and algo-rithm that learns users’ food preferences through real-timeinteractions without any requirements of diet history. Wedeveloped this system as a lightweight, easily accessible webservice that can be completed within 60 seconds. Through afield study with 227 anonymous users in the wild, we showthat our system is able to predict the food items that a userlikes/dislikes with high accuracy. The system pipeline ofPlateClick is shown in Fig. 1, which consists of offline andonline stages with several components, as follows.Food Items Harvesting. We pulled 12, 000 main dish

recipes with their images and metadata (ingredients, nutri-ents, tastes, etc.) via the Yummly API 1 and filter out imageoutliers. The final dataset has 10, 028 food items across var-ious cuisines (American, Asian, Mexican, Italian, etc.)Food Similarity Embedding. In order to tame a large

number of food items and facilitate image understanding,we learn the image similarity distance metric based on la-bels from the Food-101 dataset [4] using a deep Siamesenetwork [7]. With the trained network, we extract 1,000dim visual features for each food image. The method wepropose improves the performance of other state-of-the-artvisual feature extraction approaches. We append 200 dimingredients features to visual features, resulting in an em-bedding of the food items into a 1,200 dim space in whichsimilar items are nearby one another and dissimilar ones arefarther away.Visual User Interface. The process of our food pref-

erence elicitation consists of several iterations. We explorethe advantages of image picking and pairwise comparison ininterface design, both of which offer a potentially improved

1http://developer.yummly.com

user experience [8,28]. In each of the first two iterations, wepresent ten images and users are asked to tap on all the onesthey like. In each subsequent iteration, we present a pair offood images and ask users either to tap on whichever theyprefer or click yuck, indicating a preference for neither.

Backend Online Learning. The backend of our systemconsists of a novel online learning algorithm that exploresthe similarity between food items. Our algorithm is inspiredby label propagation [44] in locally connected graphs andthe Exponentiated Gradient Algorithm for bandit settings(EXP3) [2]. We demonstrate that this algorithm is moreeffective in our proposed workflow than other baselines.

Compared to traditional food preference learning systems,our work offers 3 major contributions and points of novelty.

• We get rid of the requirements for users’ historical dietrecords and completely bootstrap food preference fromscratch. This design enables context aware preferencelearning that adapts preference information in variousconditions.

• We propose a novel image similarity measure that sig-nificantly outperforms state-of-the-art algorithms. Byleveraging an improved embedding of food items, wesimplify the UI by making it completely image based.This offers opportunities for personal interpretation[28] and can provide an immersive experience with per-sonalized, adaptive information [9].

• We design a novel online learning algorithm that cansupport real-time elicitation of food preference frommodest number (≤ 15) of pairwise comparisons on alarge scale food image database (> 10, 000 instances).The pairwise comparison method is known to havelower cognitive load [15], and our system is thus moreuser-friendly.

We envision that PlateClick, a light-weight and efficient foodpreference elicitation system, will provide a personalizeduser experience capable of fueling a wide range of appli-cations in domains including health care, diet planning, andrestaurant recommendation.

2. RELATEDWORKCollaborative Filtering. As one of the most popular al-

gorithms adopted in current recommender systems, collabo-rative filtering (CF) [18] has been widely studied in a varietyof applications. The main idea of this method is to predictand learn a user’s preferences based on similarity measuressuch as user-based CF [18] and item-based CF [33]. It hasalso been shown that latent factor models [21] and matrixfactorization [31] are promising to predict users’ ratings forpreviously unobserved items.

A major limitation of collaborative filtering is its require-ment for historical user data. Although several techniqueshave been proposed to address the cold-start problem [23,42], the performance of CF is still largely dependent on thenumber of active users, availability of contextual informa-tion [42] and observed ratings for different items [23]. Inthe case of food preference learning, it’s typically difficult toget access to a user’s diet history since meal journaling isburdensome [9]. Therefore, in the design of PlateClick, wedon’t assume any prior knowledge of the users.

Preference elicitation. To alleviate the cold-start prob-lem mentioned above, several models of preference elicitationhave been proposed in recent years. The most prevalentmethod of elicitation is to train decision trees to poll users

in a structured fashion [10,14,30,35,45]. These questions areeither generated in advance and remain static [30] or changedynamically based on real-time user feedback [10,14,35,45].In addition, another previous work explores the possibilityof eliciting item ratings directly from the user [5, 43]. Thisprocess can either be carried at item-level [43] or within-category (e.g., movies) [5].The preference elicitation methods we mentioned above

largely focus on the domain of movie recommendations [5,30, 35, 43] and visual commerce [10] (e.g., cars, cameras)where items can be categorized based on readily availablemetadata. When it comes to real dishes, however, categor-ical data (e.g., cuisines) and other associated information(e.g., cooking time) possess a much weaker connection to auser’s food preferences. Through the design of PlateClick,we leverage the visual representation of each meal so as tobetter capture the process through which people make dietdecisions.Food preference learning system. Most existing food

preference learning approaches are hybrids of historical recordmining and rating elicitation [12,13,17,38,40]. To avail one-self of these systems to promote healthy eating, one is oftenrequired to record daily meal intake and provide this infor-mation as a bootstrapping resource to a diet recommendersystem [12,13,38]. After that, several food items are selectedfor display based on matching scores between meal metadataand the user’s previous choices. For each item provided, theuser is prompted to enter a rating on a scale from 1 to 5.To the best of our knowledge, no existing systems take vi-

sual features – arguably one of the most important factors inassisting people’s daily food choices [9] – into consideration.Additionally, the most common methods adopted in foodpreference elicitation (i.e. text based interface and numericalrating scale) impose a high cognitive load on the user [8] andare susceptible to noisy and unreliable responses. Inspiredby elicitation strategies in other domains (e.g. crowdsourc-ing [6], housing [15]), we propose a simplified, purely visualinterface that presents users with simple pairwise compar-isons. Through our field study with anonymous users, weshow that this lightweight interface can promote efficientfood preference learning.

3. METADATA PREPROCESSINGFor 12, 000 main dishes recipes pulled from Yummly API,

we filter out entries with unrelated (or missing) image con-tent, resulting in a final dataset S containing 10, 028 fooditems, i.e., S = {s1, s2, ..., s10028}. As illustrated in Fig.1,apart from visual features, we append 200 dim ingredientfeatures as the representation of each food image. The ingre-dient feature vector is calculated according to the followingpre-processing steps:1. Keyword extraction and lemmatization. For each ingre-

dient appearing in the metadata, we extract keywords fromits description and apply lemmatization; see Table 1.2. Aggregation and Filtering. We aggregate and count

the occurrences of each ingredient appearing in our dataset.We select top 200 most frequent ingredients 2 as our list ofingredient features.3. Feature Vector Calculation. For each food item si ∈ S,

its dingr = 200 dim normalized ingredient feature vectorgsi = [gsi1 , ..., gsidingr

] is finally calculated based on whether

2We will incorporate more sophisticated methods such astf-idf and homonyms/synonyms handling in the future.

ingredient j appears in food item si’s ingredient list Ingr{si},as Equation (1) shows:

gsij =

{1 / |Ingr{si}| : j ∈ Ingr{si}0 : j /∈ Ingr{si} (1)

original ingredient filtered ingredient

low moisture mozzarella mozzarella

fresh mozzarella mozzarella

less sodium beef broth beef broth

lower sodium beef broth beef broth

chicken eggs egg

soft-boiled egg egg

Table 1: Keyword extraction and lemmatization.

4. FOOD SIMILARITY EMBEDDING

4.1 Training the Deep Siamese NetworkRecent advances in similarity metric learning with Deep

Convolutional Neural Networks (CNNs) have resulted inbreakthroughs in areas including Face Verification [37], Im-age Retrieval [41], Geo-localization [24] and Product De-sign [3]. The CNN architectures in these works are based onthe Siamese Network [7], which is trained on a large numberof paired inputs of similar and dissimilar examples. In lightof the prior efforts mentioned above, we adopt this approachto generate a distance embedding for meals.

The proposed CNN architecture (Food-CNN) is illustratedin Fig. 2. The inputs of Food-CNN are a pair of color foodimages x, y ∈ S, each of size 227× 227× 3. Then, each im-age proceeds through an identical feature extraction CNNcontaining several layers from Convolution to Batch Nor-malization. Finally, the outputs of the last layer are usedas their low-dimensional feature embeddings f(x), f(y). Thearchitecture from the first Convolution layer to the last FullyConnected layer (i.e. layers in dashed line bounding box inFig.2) is the same as the architecture that achieved state-of-the-art image classification performance on ImageNet [22].For each of the layers, the numbers at the top specify itswindow size and step size ( Convolution and Max Pooling);the numbers at the bottom specify the size of its outputfeature map. For example, the first convolution layer takesa 227 × 227 × 3 color image from image data layer as theinput and convolves it with 96 filters. Each filter has a sizeof 11 × 11 × 3 and convolves the image on a grid with stepsize 4× 4. In this sense, the output of the first Convolutionlayer is a 55 × 55 feature map with 96 channels. We add afinal Batch Normalization layer to normalize the 1000 dimfeature vector so that each dimension has zero mean andunit variance within a training batch. Batch Normalizationprovides a faster convergence rate and higher accuracy inpractice [19].

Our goal of Food-CNN is to learn a low dimensional fea-ture embedding that pulls similar food items together andpushes dissimilar food items far away. Specifically, we wantf(x) and f(y) to have small distance (close to 0) if x andy are similar items; otherwise, they should have distancelarger than a margin m. Therefore, we choose Contrastive

Imag

e 2

Con

volu

tion

Max

Poo

ling

Con

volu

tion

Max

Poo

ling

Con

volu

tion

Ful

ly C

onne

cted

Bat

ch N

orm

aliz

atio

n

11x11:4x4

227x227x3

3x3:2x2 5x5:1x1 3x3:2x2 3x3:1x1 3x3:1x1 3x3:1x1

55x55x96 27x27x96 27x27x256 13x13x256 13x13x256 13x13x256 13x13x256

3x3:2x2

6x6x256 4096 4906 1000

Con

volu

tion

Con

volu

tion

Max

Poo

ling

Ful

ly C

onne

cted

Imag

e 1

Ful

ly C

onne

cted

1000

Con

tras

tive

Loss

Label

x

y

f(x)

f(y)

l = 0 or 1

Figure 2: Food-CNN architecture for supervised food similarity distance metric learning.

Figure 3: Contrastive Loss function with m = 1.

Loss proposed in [16] as the loss function to optimize Food-CNN, which can be expressed as:

L(x, y, l) = 1

2lD2 +

1

2(1− l)max (0,m−D)2 (2)

where similarity label l ∈ {0, 1} indicates whether the inputpair of food items x, y are similar or not (l = 1 for similar,l = 0 for dissimilar), m > 0 is the margin for dissimilar itemsand D = ‖f(x)− f(y)‖2 is the Euclidean Distance betweenf(x) and f(y) in embedding space.As illustrated in Fig. 3, Contrastive Loss function exactly

matches our goal. Minimizing the loss in Eqn. (2) pulls sim-ilar food images closer and pushes dissimilar ones apart as itpenalizes similar pairs by their distances quadratically anddissimilar pairs by their squared differences of the distancesto the margin m if they are smaller than m.Training a deep Siamese Network usually requires huge

amount of training data that can’t fit in memory. To addressthis problem, we adopt Nesterov’s Accelerated Gradient De-scent method [26] with Momentum algorithm [36]. We useback-propagation to compute the gradient of the loss withrespect to the parameters of each layer. Suppose we havean n-layer CNN:

f(x) = gn (gn−1 (. . . gi(. . . g1(x) . . . ))) (3)

where gi(.) represents the computation of i-th layer (e.g.,convolution, pooling etc.), x, y and f(x), f(y) represent theinput and output pairs of the Siamese Network, respec-tively. We adopt function L(.) to calculate the loss of theinput pairs x, y as L (x, y, l). To minimize loss by updat-ing parameters Wi of i-th layer, we need the gradient of

the loss with respect to Wi:∂L(.)∂Wi

= ∂L(.)∂gi× ∂gi

∂Wi. Using

back-propagation, ∂L(.)∂gi

can be calculated with the chain

rule: ∂L(.)∂gi

= ∂L(.)∂gn

× ∂gn(.)∂gn−1

× · · · × ∂gi+1(.)

∂gi. Therefore,

for k-th layer gk(.), we only need to calculate ∂gk(.)∂gk−1

and∂gk(.)∂Wk

. We use the implementation of gradient descent and

back-propagation in Caffe [20], an open source deep learningframework.

Since the food dataset S pulled from Yummly doesn’t in-clude categorical similarity annotations, we trained Food-CNN on the Food-101 dataset [4], which is the largest foodimage dataset so far and contains 101 food categories and101, 000 food images. We sampled 75, 750 same pairs and757, 500 different pairs from the training set to train ourFood-CNN.

4.2 Feature ExtractionAfter the training process, we use the pre-trained Food-

CNN to extract visual features from images in our dataset.For each item si ∈ S, we feed the food image to pre-trainedfeature extraction layers (i.e. layers in blue dashed linebounding box in Fig. 2) and get the feature vector vsi . We

normalize vsi so that it has unit �1 norm: vsii = vsii /(∑dvis

i=1 |vsii |),where vsi = [vsi1 , . . . , vsidvis

]T and vsi = [fsi1 , . . . , fsi

dvis]T are

the feature vectors before and after normalization, respec-tively; dvis = 1, 000 is the dimension of visual features.

We then concatenate the visual features vsi with ingre-dient features gsi to create a 1,200 dim feature embeddingfsi = [vsi , gsi ] for each food item si ∈ S.5. ONLINE PREFERENCE LEARNING

5.1 Online Settings and FrameworkAs discussed in previous sections, each food item si ∈S has a 1,200 dim feature vector fsi in the embeddingspace. Building upon the offline feature extraction results,we model the interaction between user and our backend sys-tem at iteration t, (t ∈ R+, t = 1, 2, ..., T ) as Fig. 4 shows.The symbols that will be used in our algorithms are definedas follows:Kt : Set of food items that are presented to user at itera-

tion t (K0 = ∅). ∀k ∈ Kt, k ∈ S;Lt−1 : Set of food items that user prefer(select) among{k|k ∈ Kt−1}. Lt−1 ⊆ Kt−1;

pt = [pt1, ..., pt|S|] : User’s preference distribution on all

food items si(i = 1, ..., | S |), where ‖pt‖1 = 1. p0 is initial-ized as p0i = 1

|S| ;Bt : Set of food images that have been already explored

until iteration t (B0 = ∅). Bi ⊆ Bj(i < j);Based on the workflow depicted in Fig. 4, for each iter-

ation t, backend system updates vector pt−1 to pt and set

Figure 4: User-system interaction at iteration t.

Bt−1 to Bt based on users’ selections Lt−1 and previous im-age set Kt−1. After that, it decides the set of images thatwill be immediately presented to the user (i.e., Kt). Ourfood preference elicitation framework can be formalized inAlgorithm. 1. The core procedures are update and select,which will be described in the following subsections for moredetails.

Algorithm 1: Food Preference Elicitation Framework

Data: S = {s1, ..., s10028}, F = {fs1 , ...,fs10028}Result: pT

1 B0 = ∅, K0 = ∅, L0 = ∅, p0 = [ 1|S| , ...,

1|S| ] ;

2 for t← 1 to T do3 [Bt,p

t]← update(Kt−1, Lt−1, Bt−1, pt−1) ;

4 Kt ← select(t, Bt, pt) ;5 if t equals T then6 return pT

7 else8 ShowToUser(Kt) ;9 Lt ← WaitForSelection() ;

5.2 User State UpdateBased on user’s selections Lt−1 and the image set Kt−1,

the update module renews user’s state from {Bt−1,pt−1}

to {Bt,pt}. Our intuition and assumption behind following

algorithm design is that people tend to have close preferencesfor similar food items in 1,200 dim space.

5.2.1 Preference vector pt

Our strategy of updating preference vector pt is inspiredby Exponentiated Gradient Algorithm in bandit settings(EXP3) [2]. Specifically, at iteration t, each pti in vectorpt is updated by:

pti ← pt−1i × e

βut−1i

pt−1i (4)

where β is the exponentiated coefficient that controls updatespeed and ut−1 = {ut−1

1 , ..., ut−1|S| } is the update vector used

to adjust each preference value.In order to calculate update vector u, we formalize user’s

selection process as a data labeling problem [44] where forsi ∈ Lt−1, label yt−1

i = 1 and for sj ∈ Kt−1\Lt−1, labelyt−1j = −1. Thus, the label vector yt−1 = {yt−1

1 , ..., yt−1|S| }

provided by user is:

yt−1i =

⎧⎨⎩

1 : si ∈ Lt−1

0 : si ∈ Kt−1

−1 : si ∈ Kt−1\Lt−1

(5)

For update vector u, we expect that it is close to la-bel vector y but with smooth propagation of label values

sa si

sb

yia = 0 yi

i = 1/− 1

yib = 0

||fsi − fsa || > δ

||fsi − fsb || ≤ δ

Figure 5: Locally connected graph with item si.

to nearby neighbors (For convenience, we omit superscriptthat denotes current iteration). The update vector u canbe regarded as a soften label vector compared with y. Tomake the solution more computationally tractable, for eachitem si with yi = 0, we construct a locally connected undi-rected graph Gi as Fig. 5 shows: ∀sj ∈ S, add an edge(si, sj) if ‖fsi − fsj‖ ≤ δ (δ = 35 in our implementation).The labels yi for vertices sj in graph Gi are calculated asyij = 0(j = 1, . . . , |S| \ i), yi

i = yi.

For each locally connected graph Gi, we fix uii value as

uii = yi

i and propose the following regularized optimizationmethod to compute other elements (∀ui

j , j = i) of update

vector ui , which is inspired by the traditional label prop-agation method [44]. Consider the problem of minimizingfollowing objective function Q(ui):

minui

|S|∑j=1,j �=i

wij(yii − ui

j)2 +

|S|∑j=1,j �=i

(1− wij)(uij − yi

j)2 (6)

In Eqn. (6), wij represents the similarity measure betweenfood item si and sj :

wij =

{e− 1

2α2 ‖fsi−fsj ‖2

: ‖fsi − fsj‖ ≤ δ0 : ‖fsi − fsj‖ > δ

(7)

where α2 = 1|S|2

∑i,j∈S‖fsi − fsj‖2.

The first term of the objective functionQ(ui) is the smooth-ness constraint as the update value for similar food itemsshould not change too much. The second term is the fit-ting constraint, which makes ui close to the initial labelingassigned by user (i.e. yi). However, unlike [44], in our al-gorithm, the trade-off between these two constraints is dy-namically adjusted by the similarity between item si and sjwhere similar pairs are weighed more with smoothness anddissimilar pairs are forced to be close to initial labeling.

With Eqn. (6) being defined, we can take the partialderivative of Q(ui) with respect to different ui

j as follows:

∂Q(ui)

uij,j �=i

= 2wij(uij − ui

i) + 2(1− wij)(uij − yi

j) = 0 (8)

As uii = yi

i , then:

uij = wiju

ii = wijy

ii(j = 1, 2, ..., | S |) (9)

After all ui are calculated, the original update vector uis then the sum of ui, i.e. u =

∑i u

i. The pseudo codefor the algorithm of updating preference vector is shown inAlgorithm.2 for details.

5.2.2 Explored food image set Bt

In order to balance the exploitation and exploration inimage selection phase, we maintain a set Bt that keeps trackof all similar food items that have already been visited byuser and the updating rule for Bt is as follows:

Bt ← Bt−1 ∪ {si ∈ S|minsj∈Kt−1‖fsi − fsj‖ ≤ δ} (10)

With the algorithms designed for updating preference vec-tor pt and explored image set Bt, the overall functionalityof procedure update is shown in Algorithm.2.

Algorithm 2: User state update Algorithm - update

1 Function update(Kt−1,Lt−1,Bt−1,pt−1)

input : Kt−1,Lt−1,Bt−1,pt−1

output: Bt,pt

2 u = [0, ..., 0],Bt = Bt−1,pt = pt−1

3 for i← 1 to | S | do4 // preference update

5 for sj in Kt−1 do

6 ui ← ui + (−1) (sj∈Lt−1)−1wij

7 pti = pt−1i e

βui

pt−1i

8 // explored image set update

9 if min(‖fsi − fsj‖, ∀sj ∈ Kt−1) ≤ δ then10 Bt ← Bt ∪ {si}11 // normalize pt s.t.‖pt‖1 = 1

12 normalize(pt)

Algorithm 3: Kmeans++ Algorithm for Exploration

1 Function k-means-pp(S, n)input : S, noutput: Kt

2 Kt=random(S)3 while | Kt |< n do4 prob ← [0, ..., 0]|S|5 for i← 1 to | S | do6 probi ← min(‖fsi − fsj‖2|∀sj ∈ Kt)

7 sample sm ∈ S with probability ∝ probm

8 Kt ← Kt ∪ {sm}

5.3 Images SelectionAfter updating user state, the select module then picks

food images that will be presented in the next round. In or-der to trade-off between exploration and exploitation in ouralgorithm, we propose different images selection strategiesbased on current iteration t.

5.3.1 Food ExplorationFor each of the first two iterations, we select ten differ-

ent food images by using K-means++ [1] algorithm, whichis a seeding method used in K-means clustering and canguarantee that selected items are evenly distributed in thefeature space. For our use case, K-means++ algorithm issummarized in Algorithm.3.

5.3.2 Food Exploitation-ExplorationStarting from the third iteration, users are asked to make

pairwise comparisons between food images. To balance theExploitation and Exploration in our algorithm design, wealways select one image from the area with higher preferencevalue based on current pt and another one from unexploredarea, i.e. S\Bt. (Both selections are random in a givensubset of food items). With above explanations, the imageselection method we propose in this application is shown inAlgorithm 4.

Algorithm 4: Images Selection Algorithm - select

1 Function select(t,Bt,pt)

input : t,Bt,pt

output: Kt

2 Kt = ∅3 if t ≤ 2 then4 Kt ← k-means-pp(S, 10) // K-means++

5 else6 // 99th percentile (top 1%)

7 threshold ← percentile(pt, 99)

8 topSet ← {si ∈ S|pti ≥ threshold}9 Kt ← [random(topSet), random(S\Bt)]

6. EXPERIMENTAL EVALUATION

6.1 Food Similarity EmbeddingWe examine and evaluate the clustering performance of

Food-CNN model on Food-101 [4] dataset, where each imageis tagged with a categorical label. We first extract 1,000 dimvisual feature for each food image in the test set. After that,we explore k, where k = 1, 2, . . . , N (N is the size of the testset), nearest neighbors of each food image and calculate thePrecision and Recall values for each k:

Suppose N ik is the set of k nearest neighbors of item i

under category Ci, then the Precision and Recall values are:

Precision =| {j|∀j ∈ N i

k ∧ j ∈ Ci} || N i

k |(11)

Recall =| {j|∀j ∈ N i

k ∧ j ∈ Ci} || Ci | (12)

In order to measure the overall performance of our em-bedding method on Food-101 test set, we average the Preci-sion/Recall values over all food images for each method andplot Precision-Recall Curve (PR Curve) as Fig. 6 shows. Weuse mean Average Precision (mAP), which corresponds tothe area under PR Curve, as the quantitative comparisonmetric. The mAP value of the ideal algorithm is equal to 1.

We compare our Food-CNN model with several state-of-the-art feature extraction methods: 1. Pretrained AlexNetdeep neural network : This is the state-of-the-art feature ex-traction method using pretrained AlexNet [22]. We take1,000 dim feature representation from the output of the lastfully-connected layer. 2. SIFT+Bag of visual Words(BoW):As the most popular method among hand-crafted featurerepresentations, SIFT [25] has been shown to be effective inseveral applications. We extract visual features using 1000

0.0 0.2 0.4 0.6 0.8 1.0Recall

10-2

10-1

100

Pre

cisi

on

(Log

ari

thm

ic S

cale

)Food-CNN (mAP: 0.216)

AlexNet (mAP: 0.051)

SIFT+BoW (mAP: 0.019)

Random Guess (mAP: 0.01)

Figure 6: Precision-recall curve for food similarityembedding(mAP: mean Average Precision, whichrepresents area under each curve).

Figure 7: Food Embedding Visualization.

words so as to guarantee that it has the same feature dimen-sion with our Food-CNN.As can be seen in our results, although SIFT+BOW and

AlexNet greatly outperform random guess baseline, theylack discriminative power for food images because these mod-els are mainly designed for the general clustering purpose.With Food-CNN, we can achieve 4 times performance im-provements over state-of-the-art models in terms of mAPvalue. Algorithm that with much better clustering powercan help the whole system understand visual content andthus improve the efficiency of online preference learning.To further verify and visualize the generalization power of

our system, for each of the recipe images that collected fromYummly, we embed it into 1200 dim feature space by firstusing Food-CNN to extract 1000 dim visual representationand then concatenate it with 200 dim ingredients feature.We project all image representations to 2-D plane by usingt-Distributed Stochastic Neighbor Embedding(t-SNE) [39]method. As shown in Fig. 7, we divide the 2-D plane intoseveral blocks and for each block, we sample a representativefood image resides in that area. The final embedding resultsclearly show that our method can effectively group similarfood items (recipes) together and push dissimilar items awaybased on their visual appearances and ingredients metadata.For example, in Fig. 7, we show that burgers, noodles, piz-zas, and meat are grouped in different areas of the featurespace.

6.2 Online user studyWe conducted field study among 227 anonymous users

that recruited from social networks and university mailinglists. The experiment was approved by Institutional ReviewBoard (ID: 1411005129) at Cornell University. All partic-ipants were required to use this system independently for

three times. Each time the study consisted of following twophases:

Training Phase. Users played with PlateClick for the firstT iterations and the system learnt and elicited preferencevector pT based on the algorithms proposed in this paperor baseline methods, which will be discussed later. We ran-domly picked T from set {5, 10, 15} at the beginning butmade sure that each user experienced different values of Tonly once.

Testing Phase. After T iterations of training, users en-tered the testing phase, which consisted of 10 rounds ofpairwise comparisons. We picked testing images based onpreference vector pT that learnt from online interactions:One of them was selected from food area that user liked(i.e. item with top 1% preference value) and the other onefrom the area that user disliked (i.e. item with bottom 1%preference value) Both of the images were picked randomlyamong unexplored food items.

6.2.1 Prediction accuracyIn order to show the learning performance of our algo-

rithm, we compare it with several combinatorial baselinesthat mentioned next. Users encountered these online learn-ing algorithms randomly when they logged into the system:

LE+EE: This is the online learning algorithm proposedin this paper that combines the ideas of Label propagation,Exponentiated Gradient algorithm for user state update andExploitation-Exploration strategy for images selection.

LE+RS: This algorithm retains our method for user stateupdate (LE) but Random Select images to present to userwithout any exploitation or exploration.

OP+EE: As each item is represented by 1200 dim featurevector, we can adopt the idea of regression to tackle thisonline learning problem (i.e. learning weight vector w suchthat wfsi is higher for item si that user prefer). Hence,we compare our method with Online Perceptron algorithmthat updates w whenever it makes error, i.e. if yiwfsi ≤ 0,assign w ← w + yiwfsi , where yi is the label for itemsi (pairwise comparison is regarded as binary classificationsuch that the food item that user select is labeled as +1 andotherwise -1). In this algorithm, we retain our strategy ofimages selection (i.e. EE).

OP+RS: The last algorithm is the Online Perceptronthat mentioned above but with Random images Selectionstrategy.

Among 227 participants in our study, 58 of them finallyused algorithm LE+EE, 57 used OP+RS. For the rest ofusers (112), half of them (56) tested OP+EE and the otherhalf (56) tested LE+RS. Overall, the participants for differ-ent algorithms are totally random so that the performancesof different models are directly comparable.

After all users going through training and testing phases,we calculate the prediction accuracy of each individualuser and aggregate them based on the context that theyencountered (i.e. the number of training iterations T andthe algorithm settings that mentioned above). The predic-tion accuracies and their cumulative distributions are shownin Fig. 8, 9 and 10 respectively.

Length effects of training iterations. As can be seenin Fig. 8 and Fig. 9, the prediction accuracies of our onlinelearning algorithm are all significantly higher than the base-lines.The algorithm performance is further improved withlonger training period. As is clearly shown in Fig. 9, when

**

****

*********

Figure 8: Prediction accuracy for different algo-rithms in various training settings (asterisks repre-sent different levels of statistical significance: ∗ ∗ ∗ :p < 0.001, ∗∗ : p < 0.01, ∗ : p < 0.05).

the number of training iterations reaches 15, about half ofthe users will experience the prediction accuracy that ex-ceeds 80%, which is fairly promising and decent consider-ing small number of interactions that system elicited fromscratch. The results above justify that PlateClick, as anonline preference learning system, can adjust itself to ex-plore users’ preference area as more information is avail-able from their choices. For the task of item-based foodpreference bootstrapping, our system can efficiently balancethe exploration-exploitation while providing reasonably ac-curate predictions.Comparison across different algorithms. As men-

tioned previously, we compared our algorithm with someobvious alternatives. Unfortunately, according to the resultsshown in Fig. 8 and Fig. 10, none of these algorithms worksvery well and the accuracy of prediction is actually decreas-ing as the user provides more information. Additionally, asis shown in Fig. 10, our algorithm has particular advantageswhen users are making progress (i.e. the number of trainingiterations reaches 15). The reasons why these techniques arenot suited for our application is mainly due to the followinglimitations:Random Selection. Within a limited number of interac-

tions, random selection can not maintain the knowledge thatit has already learned about the user (exploitation), nor ex-plore unknown areas (exploration). In addition, it’s morelikely that the system will choose food items that are verysimilar to each other and thus hard for the user to makedecisions. Therefore, after short periods of interactions, thesystem is messed up, and the performance degrades.Underfitting. The algorithm that will possibly have the

underfitting problem is the online perceptron (OP). For ourapplication, each food item is represented by 1200 dim fea-ture vector and OP is trying to learn a separate hyperplanebased on a limited number of training data. As each singlefeature is directly derived from deep neural network, the lin-earity assumptions made by perceptron might yield wrongpredictions for the dishes that haven’t been explored before.

6.2.2 System efficiencyAs another two aspects of online preference elicitation

system, computing efficiency and user experience are alsovery important metrics for system evaluation. Therefore,we recorded the program execution time and user response

Figure 9: Cumulative distribution of predictionaccuracy for LE+EE algorithm (Numbers in thelegend represent the values of T through trainingphase).

time as a lens into the real-time performance of PlateClick.As shown in Fig. 11(b), the program execution time is about0.35s for the first two iterations and less than 0.025s for theiterations afterwards3. Also, according to Fig. 11(a), themajority of users can make their decisions in less than 15sfor the task of comparison among ten food images whilethe payload for the pairwise comparison is less than 2− 3s.As a final cumulative metric for the system overhead, it isshown in Table 2 that even for 15 iterations of training, userscan typically complete the whole process within 53 seconds,which further justify that PlateClick is a light-weight user-friendly visual interface for efficient food preference elicita-tion.

# Iter: 5 # Iter: 10 # Iter: 1528.75s 39.74s 53.22s

Table 2: Average time to complete training phase.

6.2.3 User BehaviorAn interesting phenomenon that we observed from our

field study is that there exists obvious correlation betweentotal user response time and Precision of our model: Pref-erence learning of PlateClick tends to be more accurate ifthe user made decisions within shorter period of time. Asis shown in Fig. 12, we plot scatter diagram that containsall data points that users generated when they used Plate-Click under LE+EE algorithm setting and with 15 numberof training iterations (T = 15). Apparently, most of thepoints (blue ones) are above the dotted line in Fig. 12 andthe expected total response time is higher for those userswith lower prediction accuracy. The possible reason behindthis result is that if food pairs are hard to be distinguishedwith each other, the responses from user will likely to havelarge noise and uncertainty,4 which in turn affects the per-formance of online learning system.

7. DISCUSSIONIn this section we discuss limitations and future directions

for PlateClick.

3Our web system implementation is based on Amazon EC2t2-micro Linux 64-bit instance4User engagement is a critical issue in our application. Wewill conduct studies/interviews on the effects of human be-havior in the future.

(a) # training iterations: 5 (b) # training iterations: 10 (c) # training iterations: 15

Figure 10: Comparison of cumulative distribution of prediction accuracy across algorithms.

(a) User Response Time

(b) System Execution Time

Figure 11: Timestamp records for user response and system execution.

Figure 12: Scatter diagram of total user responsetime under different precision of model prediction(Points in the graph represent experimental resultsunder LE+EE algorithm setting and with 15 numberof training iterations).

Special diets. We noticed in the user study that ourcurrent system can not be efficiently used by people withspecial diets (e.g., vegetarian, vegan, kosher and halal) dueto the limits of our image corpus. It’s very unlikely that thesystem will select anything such a user would like from thecurrent main-dishes dataset. It will degrade user experienceto repeatedly show dishes that are not to their taste. In thefuture, we need to consider different diets and curate specialfood datasets accordingly.

Food courses and diversity. People’s food preferencesare highly dependent on context as determined by factorssuch as time of day. It is unnecessarily difficult and possiblyconfounding to compare dishes across different courses (e.g.,between desserts and dinner). In our current system, wemainly focus on food items from main dishes, which conve-niently constrains the space of choices available to the user.Future work will incorporate dishes across different courses(e.g., breakfast and lunch) and enrich the diversity of ourdataset.

Healthy recommendation. One of the most importantapplications that could build on PlateClick is a healthy foodrecommendation system. Combining a user’s diet profileand recipe nutrient metadata, we could recommend appeal-ing but healthier food alternatives to users so that they aremore likely to follow the system’s advice in their daily livesand choices. Suggestions guided by people’s preferences willbe more effective and persuasive than a traditional strategythat lack awareness of what a given person actually likes.This was our initial motivation for developing PlateClickand we hope to pursue integration with existing nutritionalbehavior change apps.

8. CONCLUSIONIn this work, we introduced PlateClick, a novel visual in-

terface for real-time food preference elicitation from scratch.Compared with previous solutions and online learning sys-tems, we don’t assume any prior knowledge of the user. Inaddition, we greatly simplify the elicitation user interface

by replacing traditional text-based instruments with visualcontents and leveraging a pairwise comparison method. Wedemonstrated that these design choices reduce user burdenand cognitive load when using the elicitation system.Although our algorithmic framework was originally de-

signed for visual content based food preference learning, thetechniques proposed in this paper could be used to enhancethe interplay between human hedonic and content similari-ties in solving general human-in-the-loop problems. We en-vision that PlateClick, an efficient and user-friendly foodpreference learning system, could be used to capture per-sonal diet profiles and fuel a wide range of applications inhealthcare and commercial recommender systems.

9. ACKNOWLEDGMENTSWe would like to thank Dr. Curtis Cole for suggesting the

development of food preference modeling and Dr. ThorstenJoachims for discussion of ML algorithms. Also, we appreci-ate the anonymous reviewers for insightful comments. Thisresearch is partly funded by AOL-Program for ConnectedExperiences, NSF grant IIS-1344587 and further supportedby the small data lab at Cornell Tech which receives fundingfrom UnitedHealth Group, Google, Pfizer, RWJF, NIH andNSF.

10. REFERENCES[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of

careful seeding. In ACM-SIAM symposium on Discretealgorithms, 2007.

[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Thenonstochastic multiarmed bandit problem. SIAM Journal onComputing, 2002.

[3] S. Bell and K. Bala. Learning visual similarity for productdesign with convolutional neural networks. ACM Trans. onGraphics, 2015.

[4] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101 –mining discriminative components with random forests. InECCV, 2014.

[5] S. Chang, F. M. Harper, and L. Terveen. Using groups of itemsfor preference elicitation in recommender systems. In CSCW,2015.

[6] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz.Pairwise ranking aggregation in a crowdsourced setting. InWSDM, 2013.

[7] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similaritymetric discriminatively, with application to face verification. InCVPR, 2005.

[8] V. Conitzer. Eliciting single-peaked preferences usingcomparison queries. In AAMAS, 2007.

[9] F. Cordeiro, E. Bales, E. Cherry, and J. Fogarty. Rethinkingthe mobile food journal: Exploring opportunities forlightweight photo-based capture. In CHI, 2015.

[10] M. Das, G. De Francisci Morales, A. Gionis, and I. Weber.Learning to question: leveraging user preferences for shoppingadvice. In KDD, 2013.

[11] D. Estrin. Small data, where n = me. Commun. ACM,57(4):32–34, Apr. 2014.

[12] J. Freyne and S. Berkovsky. Intelligent food planning:personalized recipe recommendation. In ACM IUI, 2010.

[13] G. Geleijnse, P. Nachtigall, P. van Kaam, and L. Wijgergangs.A personalized recipe advice system to promote healthfulchoices. In ACM IUI, 2011.

[14] N. Golbandi, Y. Koren, and R. Lempel. Adaptive bootstrappingof recommender systems using decision trees. In WSDM, 2011.

[15] S. Guo and S. Sanner. Real-time multiattribute bayesianpreference elicitation with pairwise comparison queries. InAISTATS, 2010.

[16] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reductionby learning an invariant mapping. In CVPR, 2006.

[17] M. Harvey, B. Ludwig, and D. Elsweiler. Learning user tastes:a first step to generating healthy meal plans? In InternationalWorkshop on Recommendation Technologies for LifestyleChange, 2012.

[18] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. Analgorithmic framework for performing collaborative filtering. InSIGIR, 1999.

[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. arXivpreprint arXiv:1502.03167, 2015.

[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding. arXivpreprint arXiv:1408.5093, 2014.

[21] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In KDD, 2008.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012.

[23] J. Li and W. Zhang. Conditional restricted boltzmannmachines for cold start recommendations. arXiv preprintarXiv:1408.0096, 2014.

[24] T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learning deeprepresentations for ground-to-aerial geolocalization. In CVPR,2015.

[25] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 2004.

[26] Y. Nesterov. A method of solving a convex programmingproblem with convergence rate o (1/sqrt(k)). In SovietMathematics Doklady, 1983.

[27] J. Noronha, E. Hysen, H. Zhang, and K. Z. Gajos. Platemate:crowdsourcing nutritional analysis from food photographs. InProceedings of the 24th annual ACM symposium on Userinterface software and technology. ACM, 2011.

[28] J. P. Pollak, P. Adams, and G. Gay. Pam: a photographic affectmeter for frequent, in situ measurement of affect. In CHI, 2011.

[29] P. Pu and L. Chen. User-involved preference elicitation forproduct search and recommender systems. AI magazine, 2009.

[30] A. M. Rashid, I. Albert, D. Cosley, S. K. Lam, S. M. McNee,J. A. Konstan, and J. Riedl. Getting to know you: learning newuser preferences in recommender systems. In ACM IUI, 2002.

[31] J. D. Rennie and N. Srebro. Fast maximum margin matrixfactorization for collaborative prediction. In ICML, 2005.

[32] P. Rozin, C. Fischler, S. Imada, A. Sarubin, andA. Wrzesniewski. Attitudes to food and the role of food in lifein the usa, japan, flemish belgium and france: Possibleimplications for the diet–health debate. Appetite, 1999.

[33] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-basedcollaborative filtering recommendation algorithms. In WWW,2001.

[34] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock.Methods and metrics for cold-start recommendations. InSIGIR, 2002.

[35] M. Sun, F. Li, J. Lee, K. Zhou, G. Lebanon, and H. Zha.Learning multiple-question decision trees for cold-startrecommendation. In WSDM, 2013.

[36] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On theimportance of initialization and momentum in deep learning. InICML, 2013.

[37] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:Closing the gap to human-level performance in face verification.In CVPR, 2014.

[38] C.-Y. Teng, Y.-R. Lin, and L. A. Adamic. Reciperecommendation using ingredient networks. In Annual ACMWeb Science Conference, 2012.

[39] L. Van der Maaten and G. Hinton. Visualizing data using t-sne.JMLR, 2008.

[40] Y. van Pinxteren, G. Geleijnse, and P. Kamsteeg. Deriving arecipe similarity measure for recommending healthful meals. InACM IUI, 2011.

[41] J. Wang, T. Leung, C. Rosenberg, J. Wang, J. Philbin,B. Chen, Y. Wu, et al. Learning fine-grained image similaritywith deep ranking. arXiv preprint arXiv:1404.4661, 2014.

[42] M. Zhang, J. Tang, X. Zhang, and X. Xue. Addressing coldstart in recommender systems: A semi-supervised co-trainingalgorithm. In SIGIR, 2014.

[43] X. Zhang, J. Cheng, S. Qiu, G. Zhu, and H. Lu. Dualds: A dualdiscriminative rating elicitation framework for cold startrecommendation. Knowledge-Based Systems, 2015.

[44] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf.Learning with local and global consistency. NIPS, 2004.

[45] K. Zhou, S.-H. Yang, and H. Zha. Functional matrixfactorizations for cold-start recommendation. In SIGIR, 2011.

PlateClick: Bootstrapping Food Preferences Through an Adaptive ...

Documents

Transcript of PlateClick: Bootstrapping Food Preferences Through an Adaptive ...