A Deep Multi-Modal Pairwise Ranking Model for User Generated …leehyun/08508250.pdf ·...
Transcript of A Deep Multi-Modal Pairwise Ranking Model for User Generated …leehyun/08508250.pdf ·...
2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
A Deep Multi-Modal Pairwise Ranking Model forUser Generated Food Data
Hesam Salehian∗, Surender Yerva†, Iman Barjasteh‡, Patrick Howell§ and Chul Lee¶
Under Armour Inc., 135 Townsend St, San Francisco, CA, USA∗[email protected], †[email protected], ‡[email protected],
§[email protected], ¶[email protected]
Abstract—Due to the emergence of several nutrition-relatedmobile applications and websites in recent years, as well asthe massive amount of crowd-sourced nutrition data, searchingand finding relevant results has become increasingly difficultfor users. This problem becomes even more challenging whendealing with crowd-sourced food names that are noisy and notwell-structured . Because food names are short in length, it isdifficult to incorporate existing methods to achieve an optimalmatching quality. Despite several recent studies on nutritiondata, these challenges remain. In this paper, we propose a novellearning-to-rank framework for crowd-sourced food names thathas significant real-world applications, including food search andfood recommendations. In particular, we propose a deep learningbased, multi-modal learning-to-rank model that leverages the textdescribing a food name and the numerical values that representits nutritional information. To this end, we also introduce anovel type of loss-function, which extends standard tripletshinge loss function into a multi-modal scenario. The proposedmodel is flexible and supports various data types as well as anarbitrary number of modalities. The effectiveness of our proposedmodel is demonstrated through several experiments on real-data,consisting of more than six million instances.
I. INTRODUCTION
MYFITNESSPAL (MFP) is a free health and fitness app
available in Android, iOS, and web formats that helps
people set and achieve personalized health goals through the
tracking of their nutrition and physical activity. In fact, MFP,
used by 100 million+ users, is consistently ranked at the top
of the health and fitness category in both the Apple App Store
and in Google Play. By tracking health and nutrition, MFP
enables users to gain insights that help them make smarter
choices and build healthier habits. Upon setting a personal
fitness goal, users can visually inspect their fitness and weight
loss progress and receive insights and guidance that may help
them reach their health goals. MFP also operates its own social
network, social feeds, blogging platform, and user forums.
MFP’s food data – namely, nutritional contents and the food
descriptions – are one of the main draws of the app, and are
sourced almost entirely via users inputs. While MFP’s database
carries no guarantees of nutritional accuracy due to its reliance
on crowd-sourcing, the great popularity of the app partially
speaks of the quality of its food DB that consists of over
hundreds of millions of food items, aggregated over course of
many years, and tens of billions of individual food log entries.
One crucial component for unlocking such a large database
of 100 million+ foods, is the ability to search it for relevant
results; in our case, a user inputs a text string and gets back a
list of food objects that contain food brand, food description
and nutritional information.
One natural problem that arises during food search is how
to retrieve the most relevant food objects given the text string
a user has entered. As an example, user inputs “orange” as
the query, the result set contains a wide range of food entities
containing “orange” term, including fruits, juices and desserts,
with different nutritional information. A goal then is to increase
the relevance of search results by factoring in other available
contexts, or, signals learned from real user behavior. Note
that this sort of problem has been widely studied in the field
of learning to rank in various search ranking scenarios. Our
particular learning to rank problem differs from the previously
studied instances and possesses some unique challenges that
have not been seen before. This is mainly due to peculiarity and
nature of our food data. As mentioned before, our food data
is collected via crowd-sourcing from over a hundred million
users over the course of many years. It contains text describing
each food name, and a real-value vector that corresponds to
the nutrients. As a consequence, our food data is not fully
structured, sanitized or well organized, making our ranking
problem not that straightforward. Furthermore, food names are
short in length, hence the presence/absence of a single word, or
the word ordering in the given food name would significantly
distort its semantics, making our ranking problem harder than
a traditional learning to rank problem setting.
To illustrate some of the challenges in our ranking problem,
some examples are provided in Table I. In the first two
examples, the presence of the words "pie" and "sorbet", makes
the corresponding foods irrelevant to the specified queries
"apple" and "orange", respectively, while "Fuji apple" and
"large orange" are probably more semantically relevant. The
third example is more complicated in a sense that the relevant
and irrelevant food names are actually comprised of the exact
same set of words, but subtle differences in word orderingIEEE/ACM ASONAM 2018, August 28-31, 2018, Barcelona, Spain978-1-5386-6051-5/18/$31.00 c© 2018 IEEE
503
Table ISAMPLE FOOD SEARCH RESULTS FOR POPULAR QUERIES. (CALORIE VALUES ARE 1-GRAM NORMALIZED)
Query String Relevant Food Name Relevant Food Calories Irrelevant Food Name Irrelevant Food Calories)
apple Fuji - Apple 0.50 Pie - Apple 2.37orange Orange - Large 0.47 Orange - Sorbet 1.32
spaghetti Spaghetti with meat sauce 1.57 Spaghetti sauce with meat 3.62
makes "spaghetti with meat sauce" relevant, and "spaghetti
sauce with meat" irrelevant to this particular query. It should
be noted that in all three examples, the nutritional contents
can provide an important contextual clue to make the correct
prediction of user intent.
Given this observation, to overcome the complexities of food
naming conventions in text, we decided to exploit nutrition
information of our food objects and have them incorporated
as parts of our ranking framework. For instance, for the
query = "apple", the foods "Fuji - apple" and "pie - apple"
may be evaluated as similar in name, but very different
in nutritional contents (0.5 and 2.37 calories per 1 gram,
respectively). Therefore, an effective ranking model should
take the nutrition information into account, along with the
text’s semantic features, in an intelligent manner.
To this end, we propose a new approach to a multi-modal
deep learning to rank framework to incorporate different
modalities like nutrition information as well as food names
into account. Our deep multi-modal approach is novel since
it works well for a pair-wise learning to rank problem, as
compared to many similar models that have been proposed
previously, which have mainly addressed point-wise ranking
problems due to expensive computation needs. As part of our
model, we propose a novel triplet loss function that is suitable
for our pair-wise multi-modal ranking setting, but may also
prove relevant to other forms of multi-modal learning to rank
scenarios. The benefits of the flexibility of this triplet loss
function can be observed in a few different ways. First, it
can be easily extended to handle more than two modalities by
simply adding more term variables to the given loss function.
Second, we believe that our novel multi-modal model can be
used in other problem scenarios, including but not limited to
text and image data, where a unified loss function is needed
to combine vectors extracted from different modalities. Third,
our novel objective function allows us to use different sizes
for the embedded vectors with different geometric properties,
from different modalities, while existing models simply assume
almost the same vector lengths for different data parts and
concatenate them into a single feature vector. This is crucial for
the cases where some modalities are naturally more complicated
than others, e.g., nutrition vector in our problem only has 4 real-
value components, whereas the complexity of text data demands
much larger embedding vector sizes. Therefore, using simple
concatenation on the nutrition and (much larger) embedded text
vectors would make the small-dimensional nutrition modality
negligible in the final prediction. Finally, the proposed loss
function also enables the researcher to use different distance
functions for different modalities, unlike existing models where
all embedded vectors are concatenated and treated as in the
same space, prior to the distance computation.
II. RELATED WORK
Classic learn to rank techniques often compute low-level
features to represent text data, including but not limited to TF-
IDF and BM25 [1], [2]. The ranking function is then learned
through standard machine learning techniques. The success of
these models relies heavily on the choice of features, which
often means that they require extensive feature engineering.
The advent of deep learning [3] has significantly empowered
automated and intelligent feature extraction, when sufficiently
large training data is available. Convolutional Neural Networks
(CNNs), originally proposed to extract image features [4],
have attracted enormous attention in recent years to address
ranking problems. In [5], [6], [7], CNN models, which are
good at spatial invariance, are employed to represent query and
candidate images, and the ranking is learned via pair-wise or
point-wise techniques. Closer to our domain, CNNs, specifically
1D convolution models, have also been used to learn text
features. In these settings, individual words or characters are
first represented as vectors, then form a matrix representing the
entire text. CNNs have been used for text matching/ranking,
in conjunction with word-level [8], [9], [10], [11], or with
character-level embeddings [12].
Even more recently, deep Recurrent Neural Networks (RNNs)
have shown better performance compared to CNNs, when
applied to text data. It is mainly due to the fact that text data
is usually sequential in nature, and a RNN can better model
temporal dependencies [13], [14]. In particular, Long Short
Term Memory (LSTM) deep learning models are one of the
most commonly used type of RNNs, successfully applied to text
ranking [15], machine translation [16] and language modeling
[17]. Most of the existing learn to rank techniques consider
only one type of data, e.g., image or text. In many applications,
it is desired to take data from different modalities into account
to make more accurate predictions. There are several deep
learning models presented in the literature to learn semantics
from different data modalities. In most of these techniques,
image and text features are embedded into the same embedding
space via CNN or RNN models, respectively, based on their
semantic relations. In the recent literature, this multi-modal
feature embedding approach has been used to address image
multi-labelling problem [18], [19] and image captioning [20],
[14].
However, very few works have addressed the problem of
multi-modal ranking. In [15] a pair-wise multi-modal ranking
model is learned on image and text data. The ranking framework
504
is a means to learn semantic relations between image and
text instances, while the ultimate goal is to generate relevant
captions for a given image, hence, like most of the existing
techniques, the image and text embedded vectors are mapped to
the same space. Consequently, the model in [15] is not directly
applicable to our multi-modal data consisting of nutrition
and text, because: (1) given a query text and set of multi-
modal candidates, we aim to learn the best ranking, and direct
estimation of the query’s nutritional contents is not desired,
(2) the semantic relation between text and nutrition data is
also not one-to-one, e.g., very different foods might still have
similar, or in some cases identical, nutritional contents. From
this perspective, the model in [21] seems more relevant to
our work, where a pair-wise multi-modal ranking is learned
on image and text data. In this model, the image and text
embedding spaces are separated, which is closer to our use
case, but our model is still fundamentally different from various
standpoints: (1) in [21] the text data is embedded using bag
of words features, while our model uses LSTM networks to
embed food names, which provides more intelligence, and
requires very little pre-processing, (2) in [21] the text and
image embedded vectors are concatenated, before feeding into
the ranking layer, while we keep our two modalities separated
all the way through, which is more theoretically sound. To
this end, we proposed a novel triplets hinge loss function
which takes embedded vectors from multiple modalities. The
theoretical benefits of our approach is discussed in more details
in the next section.
III. OUR MODEL
In this section, we describe our proposed deep learning
model to rank the matching of candidate pairs in a multi-
modal fashion. Similar to other pairwise ranking methods, we
are given a query text and two candidate foods, and we aim
to rank one candidate higher than the other based on match
quality. Each food candidate in our model is represented as a
multi-modal entity, consisting of: (1) text component and (2)
nutrition component. The aim of our model is to determine the
matching relevance taking consideration of both modalities.
A. Distance Function for Nutrition Vectors
Nutritional content is essential data modality, when it comes
down to food object comparison. Nutritional content can
be broadly divided to two categories: macro-nutrients and
micro-nutrients. We first introduce a novel geometric model to
represent nutritional content vectors. For simplicity and without
any loss of generality, we consider the 1× 4 macro-nutrients
vector (consisting of total energy, fat, carbs and protein) as
the vector representation of nutrition content. For the sake of
handling food objects with different serving sizes properly,
we normalize each macro-nutrient per 1-gram, for each food.
The 1× 4 vector of macro-nutrients, represented henceforth by
[e, f, c, p] corresponds to total energy (measured in calories),
fat, carbs and protein, respectively. This vector satisfies the
well-known constraint of e = 9×f+4×c+4×p [22]. Hence,
the contribution of each macro-nutrient towards the total energy
can be measured by: f ′ = 9×fe
, c′ = 4×ce
, p′ = 4×pe
, hence
f ′+ c′+ p′ = 1. Any nutritional content vector, [e, f, c, p], can
be decomposed into two components: (1) total energy, e, and (2)
normalized vector of macros, [f ′, c′, p′]. Note that total energy
is a positive value, i.e. e ∈ R+ while square root density vector
[23], i.e. M = [√f ′,
√c′,
√p′], belongs to two-dimensional
sphere, S2, since∑3
i=1Mi
2 = 1. Thus, any nutritional content
vector, [e, f, c, p], can be parameterized as [e]×[√f ′,
√c′,
√p′],
belonging to the R+×S
2 product space. Given two nutritional
content vectors N1 = [e1, f1, c1, p1] and N2 = [e2, f2, c2, p2],intrinsic distance function on this product space can be com-
puted as dist2nut(N1, N2) = dist2R+(e1, e2)+dist2
S2(M1,M2),
where Mi = [√
9×fiei
,√
4×ciei
,√
4×pi
ei], i = 1, 2 [24]. The
second term corresponds to the intrinsic distance function on
sphere which is computed as distS2(M1,M2) = cos−1(<M1,M2 >), where < . > is the vector inner product operator
[23]. Note that R+ is equivalent to the space of 1×1 Symmetric
Positive Definite (SPD) matrices. Thus, its intrinsic distance
is defined as dist2R+(e1, e2) = (Log( e1
e2))2 [25]. In summary,
given Ni, Mi and ei defined as above, we have
dist2nut(N1, N2) = [cos−1(< M1,M2 >)]2 + [Log(e1
e2)]2
(1)
B. Multi-Modal Triplets Hinge Loss
The idea of using a triplets hinge loss function for pairwise
ranking problems is not novel [26], [27]. However, we find
that a direct application of a typical triplets hinge loss function
for our problem setting is not feasible since our modalities
may require different embeddings. Thus, our proposed solution
is designed in such way that the loss function is capable of
taking multiple types of data into account, while preserving
their individual geometric properties. A starting point is to first
look at the classic single-modal hinge loss function, which can
be used for a simple text-only model. More specifically, given
a query text, Q, and a pair of candidates, P and N , labeled as
positive and negative respectively, the hinge loss is defined as:
L(Q,P,N) = max{0, γ + dist2txt(fq(Q), f(P ))
−dist2txt(fq(Q), f(N))}(2)
where γ is the gap parameter which governs the separation level
between positive and negative instances. fq(.) and f(.) are text
embedding functions for query and candidates, respectively,
which transform text inputs to their respective m-dimensional
feature vectors (i.e. fq(Q), f(P ) and f(N) ∈ Rm). In theory,
the query embedding function (fq) can be different from
the candidates function (f ), but in practice they are set to
be identical. Furthermore, disttxt(.) is the distance function
defined on the text embedding space. Either Euclidean (L2)
distance or Manhattan (L1) distance are standard distance
functions that have been used in many practical applications.
We start by illustrating the limitation of existing models when
we have to embed two modalities with different geometrical
spaces. The authors in [26] applied the hinge loss function
in Eq. 2 to the problem of learning image similarity, where
505
the non-linear embedding function f(.) is learned in a deep
learning framework. Their model is somewhat limited since it
is only applied to a single modality. In [27] a unified framework
is proposed to learn on both image and text embeddings
using a similar hinge loss function as that of [26]. In their
model, though, the original distance function is now replaced
with cosine similarity by swapping signs of positive and
negative terms (i.e. Eq. (6) in [27]). Although their model
takes into account two different modalities (i.e. image and
text), it simplifies the embedding of image and text modalities
by mapping them to the same space. Thus, the loss function
used in their model is still not applicable for two wholly
separated modalities that correspond to distinct embedding
spaces like ours. For their image caption generation problem,
where direct comparison between image and text is necessary,
it is natural to expect that using only one type of embedding
would be sufficient. In our application scenario, using only one
type of embedding would potentially limit or hinder important
information conveyed in one or the other dimension. Therefore
our proposed model aims to learn the relevance of the given
food candidate to an input query text, taking into account some
additional signals like nutritional content to improve the overall
accuracy of match ranking. Unlike the image caption generation
case, the mapping between food name to the nutritional contents
is not one-to-one, given that similar nutrition vectors can be
mapped to very distinct names due to the semantic differences.
We now present a novel triplets hinge loss function which
is an extension of Eq. 2 to the multi-modal case. Let Q
still represent the input query text, P = {Ptxt, Pnut} and
N = {Ntxt, Nnut} be the multi-modal positive and neg-
ative food candidates, respectively, where Ptxt, Ntxt both
represent the text information, and Pnut, Nnut be associated
nutrition components. The text and nutrition vectors are
of m− and n− lengths, respectively. In order to compare
foods of different quantities without any loss of generality,
each nutrition vector contains calories, fat, carbs and protein
amounts per 1-gram of the given food and hence n = 4.
For example, for butter, we have Pnut = [7.35, 0.8, 0, 0] (i.e.
7.35 calories/gram and 0.8 fat/gram) while for banana, we
have Pnut = [0.89, 0.003, 0.23, 0.011] (i.e. 0.89 calories/gram,
0.003 fat/gram, 0.23 carbs/gram and 0.011 protein/gram). Let
f(·) (fq(·) for the query input) be the unknown embedding
function that transforms the given input text into a m− di-
mensional space. Let g(.) be the unknown embedding function
that transforms the given input query text into n−dimensional
space, representing the learned nutritional content space.
Therefore, g(Q), Pnut and Nnut are all nutritional content
vectors in this embedded space of Rn. Formally, our pair-
wise multi-modal ranking problem can now be formulated by
using the following three pairs: (fq(Q), g(Q)), (f(Ptxt), Pnut),and (f(Ntxt), Nnut). It was shown in previous section that
the 1 × 4 nutrition vector belongs to the product space of
R+ × S
2. Hence, each pair corresponding to text and nutrition
(T,N) is a vector in the product space, Rm × R
+ × S2.
Accordingly, the distance function in this product space is
defined as [24]: dist2((T1, N1), (T2, N2)) = dist2txt(T1, T2) +
dist2nut(N1, N2) , where (Ti, Ni) for i = 1, 2 are two vectors
in the product space of text and nutrition content, while disttxtand distnut are distance functions defined on text and nutrition
spaces respectively.
Note that linearity of our distance function allows that the
distance function can be decomposed into text-based component
and nutrition-based component. More importantly, we can
use different distance metric for each component (e.g., L2distance for the text part, and Eq. 1 for the nutrition part).
Such flexibility with distance function is important in practice
since the separability of two objects with respect to one
single modality may not necessarily be suitable for that with
respect to other modality. This is in contrast with other similar
work [28] where all embedded vectors a and b are simply
concatenated to form a flattened vector [a1, ..., aj , b1, ..., bk]in the higher dimension space of size j + k. Furthermore,
this multi-modal distance function enables us to use intrinsic
distance corresponding to each data modality. For instance, the
nutrition vectors naturally belong to R+×S
2, where the intrinsic
distance is defined in Eq. 1. While, simple concatenation of
vectors will enforce us to use a single/non-intrinsic distance
function among all modalities, which does not preserve the
geometric properties of the data. Using the distance function
on the product space, we can now define the triplet hinge loss
function in our multi-modal model as:
L(Q,Ptxt, Ntxt, Pnut, Nnut) = max{0, γ+[dist2txt(fq(Q), f(Ptxt)) + dist2nut(g(Q), Pnut)]
− [dist2txt(fq(Q), f(Ntxt)) + dist2nut(g(Q), Nnut)]}(3)
All non-linear embedding functions fq(.), f(.) and g(.) in this
equation are learned together via a deep learning model, which
will be discussed in the next section.
C. Deep Learning Model
Long Short-Term Memory (LSTM) is one of the most
popular Recurrent Neural Network architectures that have been
successfully applied to different text understanding problems
[29]. Although, convolution-based models have also been
widely used to solve problems related to short-text data [8], [12],
these are unable to capture various complexities associated with
our short and noisy food data. In this paper, therefore, we are
adapting an LSTM-based deep neural network to better handle
some nuances that might occur with word ordering of food
names – one simple illustration of the importance of food name
order can be observed by comparing foods like “chocolate milk”
versus “milk chocolate”. Another distinguishing aspect of our
model is that unlike other works like [8], our model does not
require any extensive word embedding (e.g., Word2Vec [30])
to begin with. Instead, we use simple 1-hot vectors to represent
each word in the dictionary, as inputs to our deep learning
model. Our proposed model learns the word embeddings as
part of the training process. In the next section, we will show
the overall effectiveness of this proposed LSTM-based model
compared to a similar CNN-based architecture.
506
507
Table IIPAIRWISE RANKING ACCURACY. ALL MODELS TRAINED ON A SET OF 6.5M
TRIPLETS
Method Accuracy
Nutrition-Based LSTM 73.04Text-Based LSTM 82.16Multi-Modal CNN 91.96
Multi-Modal LSTM with Concat. Vecs 93.42Multi-Modal LSTM 94.48
• Multi-Modal CNN: A model similar to our model, but
where instead of LSTM layers, we use 1D-convolution
filters with width = 3.
• Text-Based LSTM: An LSTM based sub-model of our
full model in which only the text component is used. The
standard loss function in Eq. 2 with text vectors is applied.
• Nutrition-Based LSTM: An LSTM based sub-model of
our full model in which only nutrition content component
is used. The standard loss function in Eq. 2 with nutrition
content vectors is applied.
• Multi-Modal LSTM with Concatenated Vectors: A
model similar to our model with the only exception
that the embedded text and nutrition vectors are simply
concatenated and then the standard loss function in Eq. 2
is applied.
• Multi-Modal LSTM: Our proposed method.
All models have been implemented using Keras1, with a
Theano backend [31].
A. Training Data
Our training data consists of a set of triplets in the following
form: <query, relevant candidate, irrelevant candidate>, in
which query refers to a food search text string , relevant
(irrelevant) candidate refers to a food name candidate that
is relevant (irrelevant) to the given query. This training data
has been collected using randomly sampled food search logs
produced by hundreds of millions of food search activities in
MFP. Table III contains a summary of training data collection.
First, a set of food names that have frequently appeared
within the top 5 search results for the randomly selected
food search queries were retrieved. Then, the Click-Through
Ratio (CTR), r(F |Q), for each food F , given a query Q
is computed. Next, each pair (Q,F ) is labeled positive if
r(F |Q) > 0.2, or negative if r(F |Q) < 0.05. For each query,
Q, a set of triplets of the form (Q,P,N) were generated,
where P (N ) corresponds to positive (negative) foods. Last,
their corresponding gram-normalized nutritional content for all
candidates were retrieved. As a result, a training set of 6.5Mrandomly selected triplets was produced from food search logs.
B. Labeling of Triplets
In this section, we compare the accuracy of different models
on the task of triplets labeling. Our test set contains a set of
1https://keras.io/
Table IIITRAINING DATA COLLECTION SUMMARY.
Description Value
Number of sessions ~500MNumber of unique queries ~70MNumber of unique foods ~10M
Number of collected triplets 65K
triplets, consisting of a query string and two food candidates,
whose labels (positive/negative) are hidden for testing. Similar
to standard pairwise ranking test, each trained model is
supposed to assign a positive label to one of candidates, and
a negative label to other one. In order to do so, the distance
from the embedded query vector(s) to each of the candidates is
computed, and the candidate with the smaller distance is labeled
as positive. For multi-modal networks, the distance function in
product space of text and nutrition is computed between query
and all of its candidates. We used disttxt(x, y) =√
||x− y||2,
and distnut from Eq. 1 in all of experiments.
A small subset of our training dataset, consisting of 65krandomly selected instances, was withheld as the test set. The
same test set was used to evaluate the accuracy of all models. In
Table II, we show the results of such experiment. As expected,
it is evident from the given results that the model that is solely
based on nutritional content, i.e., Nutrition-Based LSTM, shows
the poorest performance among all 5 methods (73.04%). This
is not surprising at all due to the fact that nutrition information
is not a unique identifier of foods in general, since completely
different food items might have pretty similar nutrition content.
Next, Text-Based LSTM is able to reach a better accuracy of
82.16% but it still falls short of other advanced multi-modal
models. Once again, this is not surprising because learning
semantic relations from our crowd-sourced food DB of short
food names solely using text information is never sufficient as
has been previously pointed out. For instance, apple fuji and
apple pie might be treated as semantically close food objects
by using some standard similarity metrics (e.g. edit distance),
but obviously their nutritional content differ very significantly
and therefore these two food objects are in fact very different.
This clearly justifies why leveraging nutritional information to
build an efficient ML model for our application makes sense.
Among multi-modal approaches, Multi-Modal CNN does
a relatively good job of combining text and nutrition data to
some extent. However, it is unable to achieve the same level
accuracy as that of LSTM-based models. This might be due
to the fact that convolution based models rely mainly on word
adjacency, while our crowd-sourced food names require way
more robust way to capture and represent semantic relations
between these words. Finally, our proposed model in which the
geometric properties of the embedded text and nutrition vectors
are preserved is the clear winner by beating all other models
including the other Multi-Modal LSTM with concatenated
vectors where the embedded vectors are simply concatenated.
In Table IV, we also show some sample triplets, namely
508
Table IVSAMPLE TRIPLETS LABELED BY THREE DIFFERENT MODELS. THE GAP BETWEEN POSITIVE AND NEGATIVE INSTANCES (WHERE LARGER IS BETTER) IS
DEFINED AS g = dist(Q,N)− dist(Q,P ). NEGATIVE GAP VALUES IMPLY THAT THE LABELS WERE INCORRECTLY PREDICTED. Pnut AND Nnut
INDICATE THE NUTRITION VECTOR VALUES.
Query String Positive Candidate Negative Candidate Model dist(Q, P) dist(Q, N) Gap
AppleGeneric Fuji Apple
Pnut: [0.52, 0.01, 0.14, 0.01]Apple Strudel
Nnut: [2.74, 0.11, 0.42, 0.03]
Text-based LSTM 0.657 0.057 -0.600Multi-Modal CNN 0.800 1.004 +0.204
Multi-Modal LSTM 0.659 0.989 +0.330
Black PepperSpice Ground Black PepperPnut: [2.17, 0, 0.43, 0]
Graze Black Pepper PistachioNnut: [3.21, 0.32, 0.03, 0.10]
Text-Based LSTM 0.607 0.988 +0.381Multi-Modal CNN 0.941 0.939 -0.002
Multi-Modal LSTM 0.607 1.172 +0.565
Table VNDCG@10 SCORE (%)
Method "apple" "black pepper" "salt" "white flour" "pizza" Average over 30 queries
Text-based LSTM 83.21 83.85 43.38 52.45 93.44 88.90Multi-modal CNN 93.12 83.85 52.83 54.12 93.44 90.57
Multi-modal LSTM 100.0 90.60 58.31 56.92 94.24 92.72
“Apple” and “Black Pepper”, from the test set, along with their
corresponding distances between the given query and each
candidate, measured with respect to three different models:
(1) Text-Based LSTM, (2) Multi-Modal CNN and (3) Multi-
Modal LSTM (proposed). In the first example, “Apple”, Text-
Based LSTM clearly failed to assign the correct label to input
candidates. This is because, for instance, relative text based
distance between apple and apple strudel is much smaller than
text based distance between apple and generic fuji apple. In
contrast, Multi-Modal models were more successful in predict-
ing labels, clearly showing the power of leveraging multiple
modalities. Multi-Modal LSTM shows a larger separation value
(i.e., gap) between two given positive and negative candidates.
The gap value here is defined as the difference between
dist(Q,N) and dist(Q,P ), where dist(.) is the corresponding
distance function learned by each model. Therefore, a larger
gap value indicates that the distance function is more correct
in distinguishing differences between the targets. In the second
example, “Black Pepper”, all labels were correctly assigned by
Text-Based LSTM, while Multi-Modal CNN failed to do so. On
the other hand, our Multi-Modal LSTM was not only able to
predict the correct label, but it also increased the gap between
negative and positive instances by almost 20%. Both examples
clearly illustrate the overall superiority of our proposed model.
C. Food Search Ranking
In order to evaluate the performance of a learning to ranking
model, it is common practice to evaluate the given model in
a real search application setting. In this section, we compare
the performance of the following three models in a real-world
food search ranking setting: (1) Text-Based LSTM, (2) Multi-
Modal CNN and (3) Multi-Modal LSTM. For this experiment,
we used the top 10 food search results from the top 30 most
popular queries. Each food name was assigned a label between
0 and 5, based on the Click-Through-Ratio (CTR) observed in
user search log events; the 0 being completely irrelevant and
the 5 being completely relevant. For every food corresponding
to the given query, all embedded vectors from each model
were computed, and distance between the given query and the
food candidate was measured. All items were ranked in an
ascending order, with respect to its distance to the given query,
and finally Normalized Discounted Cumulative gain (NDCG)
score [32] was computed for each ranked set.
Table V summarizes the NDCG scores for this experiment.
Even for some challenging queries like “salt” and “white
flour,” it is evident that across all cases our proposed Multi-
Modal LSTM approach performs the best among all three
competing models. Furthermore, the rightmost column contains
the average NDCG score computed over all 30 queries, which
shows that the Multi-Modal LSTM model as the winner, once
again. This experiment further proves that our proposed model
works very well even for real-world food search applications.
V. CONCLUSION
In this paper a deep learning model is proposed to address
the pairwise multi-modal ranking problem. The main novelty of
the proposed approach is the extension of the standard triplets
hinge loss function to a multi-modal scenario. We applied
LSTM models in order to effectively embed the text inputs
into numerical feature vectors and then supplemented these
features with an additional modality of nutrition. The entire
model parameters are all trained at once, without extensive pre-
processing steps to compute the word embeddings. Although,
the proposed model is designed to address the problem of
foods ranking with text and nutrition, it can be readily applied
to other data types, with any arbitrary number of modalities.
REFERENCES
[1] L. Hang, “A short introduction to learning to rank,” IEICE TRANSAC-
TIONS on Information and Systems, vol. 94, no. 10, pp. 1854–1862,2011.
[2] H. Li, “Learning to rank for information retrieval and natural languageprocessing,” Synthesis Lectures on Human Language Technologies, vol. 7,no. 3, pp. 1–121, 2014.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.
509
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.[5] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,
and Y. Wu, “Learning fine-grained image similarity with deep ranking,”in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 1386–1393.[6] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based
hashing for multi-label image retrieval,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp.1556–1564.
[7] X. Zhao, X. Li, and Z. Zhang, “Multimedia retrieval via deep learningto rank,” IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1487–1491,2015.
[8] A. Severyn and A. Moschitti, “Learning to rank short text pairswith convolutional deep neural networks,” in Proceedings of the 38th
International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM, 2015, pp. 373–382.[9] Z. Lu and H. Li, “A deep architecture for matching short texts,” in
Advances in Neural Information Processing Systems, 2013, pp. 1367–1375.
[10] L. Rigutini, T. Papini, M. Maggini, and M. Bianchini, “A neural networkapproach for learning object ranking,” in International Conference on
Artificial Neural Networks. Springer, 2008, pp. 899–908.[11] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional
ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894,2013.
[12] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networksfor text classification,” in Advances in Neural Information Processing
Systems, 2015, pp. 649–657.[13] Z. Cao, F. Wei, L. Dong, S. Li, and M. Zhou, “Ranking with recursive
neural networks and its application to multi-document summarization.”in AAAI, 2015, pp. 2153–2159.
[14] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deepcaptioning with multimodal recurrent neural networks (m-rnn),” arXiv
preprint arXiv:1412.6632, 2014.[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic
embeddings with multimodal neural language models,” arXiv preprint
arXiv:1411.2539, 2014.[16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems, 2014, pp. 3104–3112.[17] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for
language modeling.” in Interspeech, 2012, pp. 194–197.[18] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A
unified framework for multi-label image classification,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2285–2294.
[19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolovet al., “Devise: A deep visual-semantic embedding model,” in Advances
in neural information processing systems, 2013, pp. 2121–2129.[20] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.[21] C. Lynch, K. Aryafar, and J. Attenberg, “Images don’t lie: Transferring
deep visual semantic features to large-scale multimodal learning to rank,”arXiv preprint arXiv:1511.06746, 2015.
[22] P. D. Howell, L. D. Martin, H. Salehian, C. Lee, K. M. Eastman, andJ. Kim, “Analyzing taste preferences from crowdsourced food entries,”in Proceedings of the 6th International Conference on Digital Health
Conference. ACM, 2016, pp. 131–140.[23] A. Srivastava, I. Jermyn, and S. Joshi, “Riemannian analysis of probability
density functions with applications in vision,” in Computer Vision and
Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007,pp. 1–8.
[24] J. Lee, “Riemannian geometry: An introduction to curvature, no. 176 ingraduate texts in mathematics,” 1997.
[25] M. Moakher, “A differential geometric approach to the geometric mean ofsymmetric positive-definite matrices,” SIAM Journal on Matrix Analysis
and Applications, vol. 26, no. 3, pp. 735–747, 2005.[26] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,
and Y. Wu, “Learning fine-grained image similarity with deep ranking,”in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 1386–1393.
[27] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semanticembeddings with multimodal neural language models,” arXiv preprint
arXiv:1411.2539, 2014.[28] C. Lynch, K. Aryafar, and J. Attenberg, “Images don’t lie: Transferring
deep visual semantic features to large-scale multimodal learning to rank,”arXiv preprint arXiv:1511.06746, 2015.
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.[30] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Efficient
estimation of word representations in vector space,” in Proceedings of
Workshop at ICLR, 2013.[31] F. Chollet, “Keras: Theano-based deep learning library,” Code:
https://github. com/fchollet. Documentation: http://keras. io, 2015.[32] Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu, “A theoretical analysis of
ndcg type ranking measures,” in Conference on Learning Theory, 2013,pp. 25–54.
510