Unpublishedworkingdraft.
Not for distribution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
Representation Learning for Visual-RelationalKnowledge Graphs
Daniel Oñoro-RubioNEC Labs Europe
Heidelberg, [email protected]
Mathias NiepertNEC Labs Europe
Heidelberg, [email protected]
Alberto García-DuránNEC Labs Europe
Heidelberg, [email protected]
Roberto González-SánchezNEC Labs Europe
Heidelberg, [email protected]
Roberto J. López-SastreUniversity of Alcalá
Alcaá de Henares, [email protected]
ABSTRACT
Avisual-relational knowledge graph (KG) is amulti-relational graphwhose entities are associated with images. We introduce Image-Graph1, a KG with 1,330 relation types, 14,870 entities, and 829,931images. Visual-relational KGs lead to novel probabilistic query typeswhere images are treated as first-class citizens. Both the predictionof relations between unseen images and multi-relational imageretrieval can be formulated as query types in a visual-relationalKG. We approach the problem of answering such queries with anovel combination of deep convolutional networks and models forlearning knowledge graph embeddings. The resulting models cananswer queries such as “How are these two unseen images related toeach other?" We also explore a zero-shot learning scenario where animage of an entirely new entity is linked with multiple relations toentities of an existing KG. The multi-relational grounding of unseenentity images into a knowledge graph serves as the description ofsuch an entity. We conduct experiments to demonstrate that theproposed deep architectures in combination with KG embeddingobjectives can answer the visual-relational queries efficiently andaccurately.ACM Reference Format:
Daniel Oñoro-Rubio,Mathias Niepert, Alberto García-Durán, Roberto González-Sánchez, and Roberto J. López-Sastre. 2018. Representation Learning forVisual-Relational Knowledge Graphs. In Proceedings of Deep Learning Day(KDD’2018 Workshop). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Several application domains can bemodeledwith knowledge graphswhere entities are represented by nodes, object attributes by nodeattributes, and relationships between entities by directed edges be-tween the nodes. For instance, a product recommendation systemcan be represented as a knowledge graph where nodes representcustomers and products and where typed edges represent customerreviews and purchasing events. In the medical domain, there areseveral knowledge graphs that model diseases, symptoms, drugs,genes, and their interactions (cf. [2, 36, 51]). Increasingly, entities
1Project URL: https://github.com/nle-ml/mmkb.git.
Unpublished working draft. Not for distribution.
KDD’2018 Workshop, August, 2018, London, England2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Tokyo Japan
Sensō-ji
Gotoh Museum Murasaki Shikibu
captialOflo
cate
dIn
hasArtAbout born
In
locatedIn
Figure 1: Visual-relational knowledge graph.
in these knowledge graphs are associated with visual data. Forinstance, in the online retail domain, there are product and advertis-ing images and in the medical domain, there are patient-associatedimaging data sets (MRIs, CTs, and so on).
The ability of knowledge graphs to compactly represent a do-main, its attributes, and relations make them an important com-ponent of numerous AI systems. KGs facilitate the integration,organization, and retrieval of structured data and support variousforms of reasoning. In recent years KGs have been playing an in-creasingly crucial role in fields such as question answering [6, 9],language modeling [1], and text generation [40]. Even though thereis a large body of work on learning and reasoning in KGs, the set-ting of visual-relational KGs, where entities are associated withvisual data, has not received much attention. A visual-relationalKG represents entities, relations between these entities, and a largenumber of images associated with the entities (see Figure 1 foran example). While ImageNet [10] and the VisualGenome [23]datasets are based on KGs such as WordNet they are predominantlyused as either an object classification data set as in the case of Ima-geNet or to facilitate scene understanding in a single image. WithImageGraph, we propose the problem of reasoning about visualconcepts across a large set of images organized in a knowledgegraph.
2018-06-29 17:56. Page 1 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
Table 1: Statistics of the knowledge graphs used in this paper.
Entities Relations Triples Images|E | |R | Train Valid Test Train Valid Test
ImageNet [10] 21,841 18 - 14,197,122VisualGenome [23] 75,729 40,480 1,531,448 108,077FB15k [7] 14,951 1,345 483,142 50,000 59,071 0 0 0ImageGraph 14,870 1,330 460,406 47,533 56,071 411,306 201,832 216,793
Japan Football Michael Jackson Madrid The Simpson Drummer
Figure 2: Image samples for some entities of ImageGraph.
?
locatedIn?
New entity Japan
?
(1)
(2)
(4)
New entity
?(3)
Figure 3: Set of query types.
The core idea is to treat images as first-class citizens both inthe KG and in relational KG completion queries. In combinationwith the multi-relational structure of a KG, numerous more com-plex queries are possible. The main objective of our work is tounderstand to what extent visual data associated with entities ofa KG can be used in conjunction with deep learning methods toanswer visual-relational queries. Allowing images to be argumentsof queries facilitates numerous novel query types. In Figure 3 welist some of the query types we address in this paper. In order toanswer these queries, we built both on KG embedding methodsas well as deep representation learning approaches for visual data.There has been a flurry of machine learning approaches tailoredto specific problems such as link prediction in knowledge graphs.Examples are knowledge base factorization and embedding ap-proaches [7, 18, 30, 32] and random-walk based ML models [14, 24].We combine these approaches with deep neural networks to facili-tate visual-relational query answering.
There are numerous application domains that could benefit fromquery answering in visual KGs. For instance, in online retail, visualrepresentations of novel products could be leveraged for zero-shotproduct recommendations. Crucially, instead of only being ableto retrieve similar products, a visual-relational KG would supportthe prediction of product attributes and more specifically whatattributes customers might be interested in. For instance, in thefashion industry visual attributes are crucial for product recom-mendations [26, 41, 48, 49]. In general, we believe that being ableto ground novel visual concepts into an existing KG with attributesand various relation types is a reasonable approach to zero-shotlearning.
We make the following contributions. First, we introduce Image-Graph, a visual-relational KG with 1,330 relations where 829,931images are associated with 14,870 different entities. Second, weintroduce a new set of visual-relational query types. Third, we pro-pose a novel set of neural architectures and objectives that we usefor answering these novel query types. This is the first time thatdeep CNNs and KG embedding learning objectives are combinedinto a joint model. Fourth, we show that the proposed class of deepneural networks are also successful for zero-shot learning, thatis, creating relations between entirely unseen entities and the KGusing only visual data at query time.
2 RELATEDWORK
We discuss the relation of our contributions to previous work withan emphasis on object detection, scene understanding, existing datasets, and zero-shot learning.
Relational and Visual Data. Answering queries in a visual-relationalknowledge graph is our main objective. Previous work on com-bining relational and visual data has focused on object detection[13, 15, 25, 29, 37] and scene recognition [11, 34, 38, 45, 52] which arerequired for more complex visual-relational reasoning. Recent yearshave witnessed a surge in reasoning about human-object, object-object, and object-attribute relationships [8, 12, 13, 17, 19, 28, 54, 56].
2018-06-29 17:56. Page 2 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
100 101 102 103
Relation types
101
103
Trip
le c
ount
award_nomineeprofessiondisease
production_company
tv_actor
ingredient
(a)
Symmetric
4%
Asymmetric
88%
Others
8%
(b)
Figure 4: 4a plots the distribution of relation type frequencies. The y-axis represents the number of occurrences and x-axis
the relation type index. 4b shows the most common relation types.
Table 2: Relation samples collected for symmetric, asymmetric and others relation types.
Relation type Example (h, r , t )
Symmetric (EmmaThompson,sibling,SophieThompson)(SophieThompson,sibling,EmmaThompson)
Asymmetric (Non-profitorganization,company_type,ApacheSoftwareFoundation)(Statistics,students_majoring,PhD)
Others(StarWars,film_series,StarWars)
(StarWarsEpisodeI:ThePhantomMenace,film_series,StarWars)(StarWarsEpisodeII:AttackoftheClones,film_series,StarWars)
The VisualGenome project [23] is a knowledge base that integrateslanguage and vision modalities. The project provides a knowledgegraph, based on WordNet, which provides annotations of cate-gories, attributes, and relation types for each image. Recent workhas used the dataset to focus on scene understanding in singleimages. For instance, Lu et al. [27] proposed a model to detectrelation types between objects depicted in an image by inferringsentences such as “man riding bicycle." Veit et al. [49] propose asiamese CNN to learn a metric representation on pairs of textileproducts so as to learn which products have similar styles. Thereis a large body of work on metric learning where the objective isto generate image embeddings such that a pairwise distance-basedloss is minimized [4, 33, 39, 43, 50]. Recent work has extended thisidea to directly optimize a clustering quality metric [44]. Zhou et al.propose a method based on a bipartite graph that links depictionsof meals to its ingredients. Johnson et al. [21] propose to use theVisualGenome data to recover images from text queries. Image-Graph is different from these data sets in that the relation typeshold between different images and image annotated entities. Thisdefines a novel class of problems where one seeks to answer queriessuch as “How are these two images related?" With this work, weaddress problems ranging from predicting the relation types forimage pairs to multi-relational image retrieval.
Zero-shot Learning. We focus on exploring ways in which KGscan be used to find relationships between unseen images, that is,images depicting novel entities that are not part of the KG, andvisual depictions of known KG entities. This is a form of zero-shotlearning (ZSL) where the objective is to generalize to novel visual
concepts without seeing any training examples. Generally, ZSLmethods (e.g. [35, 55]) rely on an underlying embedding space, suchas attributes, in order to recognize the unseen categories. However,in this paper, we do not assume the availability of such a commonembedding space but we assume the existence of an external visual-relational KG. Similar to our approach, when this explicit knowledgeis not encoded in the underlying embedding space, other works relyon finding the similarities through the linguistic space (e.g. [3, 27]),leveraging distributional word representations so as to capture anotion of taxonomy and similarity. But these works address sceneunderstanding in a single image, i.e. these models are able to detectthe visual relationships in one given image. On the contrary, ourmodels are able to find relationships between different images andentities.
3 IMAGEGRAPH: A VISUAL-RELATIONAL
KNOWLEDGE GRAPH
ImageGraph is a visual-relational KG whose relational structureis based on that of Freebase [5]. More specifically, it is based onFB15k, a subset of FreeBase, which has been used as a benchmarkdata set [30]. Since FB15k does not include visual data, we performthe following steps to enrich the KG entities with image data. Weimplemented a web crawler that is able to parse query results forthe image search engines Google Images, Bing Images, and YahooImage Search. To minimize the amount of noise due to polyse-mous entity labels (for example, there are more than 100 Freebaseentities with the text label “Springfield") we extracted, for each
2018-06-29 17:56. Page 3 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
0
200
400
600
800
1000
1200
Enti
ty C
ount
(a)
101 103
Entities
100
101
102
103
104
Trip
le c
ount
United States of America
English Language
Executive Producer
University of Oxford
Parlophone
Vigor Shipyards
(b)
Figure 5: 5b depicts the 10 most frequent entity types. 5a plots the entity distribution. The y-axis represents the total number
of occurrences and the x-axis the entity indices.
entity in FB15k, all Wikipedia URIs from the 1.9 billion triple Free-base RDF dump. For instance, for Springfield, Massachusetts, weobtained such URIs as Springfield_(Massachusetts,United_States) and Springfield_(MA). These URIs were processed andused as search queries for disambiguation purposes. We used thecrawler to download more than 2.4M images (more than 462Gb ofdata). We removed corrupted, low quality, and duplicate imagesand we used the 25 top images returned by each of the image searchengines whenever there were more than 25 results. The imageswere scaled to have a maximum height or width of 500 pixels whilemaintaining their aspect ratio. This resulted in 829,931 images asso-ciated with 14,870 different entities (55.8 images per entity). Afterfiltering out triples where either the head or tail entity could not beassociated with an image, the visual KG consists of 564,010 triplesexpressing 1,330 different relation types between 14,870 entities.We provide three sets of triples for training, validation, and test-ing plus three more image splits also for training, validation andtest. Table 1 lists the statistics of the resulting visual KG. Any KGderived from FB15k such as FB15k-237[46] can also be associatedwith the crawled images. Since providing the images themselveswould violate copyright law, we provide the code for the distributedcrawler and the list of image URLs crawled for the experiments inthis paper2.
The distribution of relation types is depicted in the Figure4a. We show in logarithmic scale the number of times that eachrelation occurs in the KG. We observe how relationships likeaward_nominee or profession occur quite frequently while oth-ers such as ingredient occur just a few times. 4% of the relationtypes are symmetric, 88% are asymmetric, and 8% are others (seeFigure 4b). Where a symmetric relation implies (h, r , t )⇒ (t , r ,h),an asymmetric relation satisfy (h, r , t ) ⇒ (t ,¬r ,h), and in otherswe group those relations that are neither symmetric or asymmetric.Table 2 shows some qualitative samples of the relations types. Thereare 585 distinct entity types such as Person, Athlete, and City. In
2ImageGraph crawler and URLs: https://github.com/robegs/imageDownloader.
Figure 5a we plot he most frequent entity types. In the Figure 4bwe plot the entity frequencies and some example entities.
In Table 1 are listed the statistics of ImageGraph and similarworks. First, we would like remark the differences between Image-Graph and Visual Genome data (VGD) [23]. With ImageGraphwe address the problem of learning a representation that directlymaps the relations between images, without the involvement oftext. On a high level, we focus on performing rankings as responsesto probabilistic queries. In some sense, this is similar to informationretrieval except that in our proposed work, images are first-classcitizens. In contrast, VGD is focused on modeling relations betweenobjects in images and the relations are expressed in natural lan-guage. Second, the main differences between ImageGraph andImageNet are the following. ImageNet is based on WordNet alexical database where synonymous words from the same lexicalcategory are grouped into synsets. There are 18 relations expressingconnections between synsets. In Freebase, on the other hand, thereare two orders of magnitudes more relations. In FB15k, the subsetwe focus on, there are 1,345 relations expressing location of places,positions of basketball players, and gender of entities. Moreover,entities in ImageNet exclusively represent entity types such asCats and Cars whereas entities in FB15k are either entity types orinstances of entity types such as Albert Einstein and Paris. Thisrenders the computer vision problems associated with ImageGraphmore challenging than those for existing datasets. Moreover, withImageGraph the focus is on learning relational ML models thatincorporate visual data both during learning and at query time.
4 REPRESENTATION LEARNING FOR
VISUAL-RELATIONAL GRAPHS
A knowledge graph (KG) K is given by a set of triples T, that is,statements of the form (h, r, t), where h, t ∈ E are the head and tailentities, respectively, and r ∈ R is a relation type. Figure 1 depicts asmall fragment of a KG with relations between entities and imagesassociated with the entities. Prior work has not included image dataand has, therefore, focused on the following two types of queries.
2018-06-29 17:56. Page 4 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
First, the query type (h, r?, t) asks for the relations between a givenpair of head and tail entities. Second, the query types (h, r, t?) and(h?, r, t), asks for entities correctly completing the triple. The latterquery type is often referred to as knowledge base completion. Here,we focus on queries that involve visual data as query objects, thatis, objects that are either contained in the queries, the answers tothe queries, or both.
4.1 Visual-Relational Query Answering
When entities are associated with image data, several completelynovel query types are possible. Figure 3 lists the query types wefocus on in this paper. We refer to images used during training asseen and all other images as unseen.
(1) Given a pair of unseen images for which we do not knowtheir KG entities, determine the unknown relations betweenthe underlying entities.
(2) Given an unseen image, for which we do not know the un-derlying KG entity, and a relation type, determine the seenimages that complete the query.
(3) Given an unseen image of an entirely new entity that is notpart of the KG, and an unseen image for which we do notknow the underlying KG entity, determine the unknownrelations between the two underlying entities.
(4) Given an unseen image of an entirely new entity that is notpart of the KG, and a known KG entity, determine the un-known relations between the two entities.
For each of these query types, the sought-after relations betweenthe underlying entities have never been observed during training.Query types (3) and (4) are a form of zero-shot learning since neitherthe new entity’s relationships with other entities nor its imageshave been observed during training. These considerations illustratethe novel nature of the visual query types. The machine learningmodels have to be able to learn the relational semantics of the KGand not simply a classifier that assigns images to entities. Thesequery types are also motivated by the fact that for typical KGs thenumber of entities is orders of magnitude greater than the numberof relations.
4.2 Deep Representation Learning for Query
Answering
We first discuss the state of the art of KG embedding methods andtranslate the concepts to query answering in visual-relational KGs.Let rawi be the raw feature representation for entity i ∈ R and letf and д be differentiable functions. Most KG completion methodslearn an embedding of the entities in a vector space via some scoringfunction that is trained to assign high scores to correct triples andlow scores to incorrect triples. Scoring functions have often theform fr(eh, et) where r is a relation, eh and et are d-dimensionalvectors (the embeddings of the head and tail entities, respectively),and where ei = д(rawi) is an embedding function that maps theraw input representation of entities to the embedding space. In thecase of KGs without additional visual data, the raw representationof an entity is simply its one-hot encoding.
Existing KG completion methods use the embedding functionд(rawh) = raw⊺
iW whereW is a |E |×d matrix, and differ only intheir scoring function, that is, in the way that the embedding vectors
of the head and tail entities are combined with the parameter vectorϕr:• Difference (TransE[7]): fr(eh, et) = −||eh +ϕr − et | |2 whereϕr is a d-dimensional vector;• Multiplication (DistMult[53]): fr(eh, et) = (eh ∗ et) · ϕrwhere ∗ is the element-wise product and ϕr a d-dimensionalvector;• Circular correlation (HolE[31]): fr(eh, et) = (eh ⋆ et) · ϕrwhere [a⋆b]k =
∑d−1i=0 aib(i+k ) mod d andϕr ad-dimensional
vector; and• Concatenation: fr(eh, et) = (eh ⊙ et) · ϕr where ⊙ is theconcatenation operator and ϕr a 2d-dimensional vector.
For each of these instances, the matrixW (storing the entityembeddings) and the vectors ϕr are learned during training. Ingeneral, the parameters are trained such that fr(eh, et) is high fortrue triples and low for triples assumed not to hold in the KG. Thetraining objective is often based on the logistic loss, which has beenshown to be superior for most of the composition functions [47],
(1)
minΘ
∑(h,r,t) ∈Tpos
log(1 + exp(−fr(eh, et))
+∑
(h,r,t)∈Tneg
log(1 + exp(fr(eh, et))) + λ | |Θ| |22,
where Tpos and Tneg are the set of positive and negative trainingtriples, respectively, Θ are the parameters trained during learningand λ is a regularization hyperparameter. For the above objective,a process for creating corrupted triples Tneg is required. This ofteninvolves sampling a random entity for either the head or tail entity.To answer queries of the types (h, r, t?) and (h?, r, t) after training,we form all possible completions of the queries and compute aranking based on the scores assigned by the trained model to thesecompletions.
For the queries of type (h, r?, t) one typically uses the softmaxactivation in conjunction with the categorical cross-entropy loss,which does not require negative triples
minΘ
∑(h,r,t)∈Tpos
− log(
exp(fr(eh, et))∑r∈R exp(fr(eh, et))
)+ λ | |Θ| |22 , (2)
where Θ are the parameters trained during learning.For visual-relational KGs, the input consists of raw image data
instead of the one-hot encodings of entities. The approach we pro-pose builds on the ideas and methods developed for KG completion.Instead of having a simple embedding function g that multipliesthe input with a weight matrix, however, we use deep convolu-tional neural networks to extract meaningful visual features fromthe input images. For the composition function f we evaluate thefour operations that were used in the KG completion literature:difference, multiplication, concatenation, and circular correlation.Figure 6a depicts the basic architecture we trained for query an-swering. The weights of the parts of the neural network responsiblefor embedding the raw image input, denoted by g, are tied. We alsoexperimented with additional hidden layers indicated by the dasheddense layer. The composition operation op is either difference, mul-tiplication, concatenation, or circular correlation. To the best ofour knowledge, this is the first time that KG embedding learning
2018-06-29 17:56. Page 5 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
op
f
g
gr
VGG16
VGG16
256256
(a)
r?
Sensō-ji Japan
r?VGG16 VGG16
DistMult
(b)
Figure 6: (a) Proposed architecture for query answering. (b) Illustration of two possible approaches to visual-relational query
answering. One can predict relation types between two images directly (green arrow; our approach) or combine an entity
classifier with a KB embedding model for relation prediction (red arrows; baseline VGG16+DistMult).
and deep CNNs have been combined for visual-relationsl queryanswering.
5 EXPERIMENTS
We conduct a series of experiments to evaluate our proposed ap-proach to visual-relational query answering. First, we describe theexperimental set-up that applies to all experiments. Second, we re-port and interpret results for the different types of visual-relationalqueries.
5.1 General Set-up
We used Caffe, a deep learning framework [20] for designing, train-ing, and evaluating the proposed models. The embedding functiong is based on the VGG16 model introduced in [42]. We pre-trainedthe VGG16 on the ILSVRC2012 data set derived from ImageNet [10]and removed the softmax layer of the original VGG16. We added a256-dimensional layer after the last dense layer of the VGG16. Theoutput of this layer serves as the embedding of the input images.The reason for reducing the embedding dimensionality from 4096to 256 is motivated by the objective to obtain an efficient and com-pact latent representation that is feasible for KGs with billion ofentities. For the composition function f, we performed either of thefour operations difference, multiplication, concatenation, and cir-cular correlation. We also experimented with an additional hiddenlayer with ReLu activation. Figure 6a depicts the generic networkarchitecture. The output layer of the architecture has a softmaxor sigmoid activation with cross-entropy loss. We initialized theweights of the newly added layers with the Xavier method [16].
We used a batch size of 45 which was the maximal possiblefitting into GPU memory. To create the training batches, we samplea random triple uniformly at random from the training triples.For the given triple, we randomly sample one image for the headand one for the tail from the set of training images. We appliedSGD with a learning rate of 10−5 for the parameters of the VGG16and a learning rate of 10−3 for the remaining parameters. It iscrucial to use two different learning rates since the large gradientsin the newly added layers would lead to unreasonable changesin the pretrained part of the network. We set the weight decay to5×10−4. We reduced the learning rate by a factor of 0.1 every 40,000iterations. Each of the models was trained for 100,000 iterations.
Table 3: Results for the relation prediction problem.
Model Median Hits@1 Hits@10 MRRVGG16+DistMult 94 6.0 11.4 0.087Prob. Baseline 35 3.7 26.5 0.104DIFF 11 21.1 50.0 0.307MULT 8 15.5 54.3 0.282CAT 6 26.7 61.0 0.378
DIFF+1HL 8 22.6 55.7 0.333MULT+1HL 9 14.8 53.4 0.273CAT+1HL 6 25.3 60.0 0.365
Since the answers to all query types are either rankings of imagesor rankings of relations, we utilize metrics measuring the qualityof rankings. In particular, we report results for hits@1 (hits@10,hits@100) measuring the percentage of times the correct relationwas ranked highest (ranked in the top 10, top 100). We also com-pute the median of the ranks of the correct entities or relations andthe Mean Reciprocal Rank (MRR) for entity and relation rankings,respectively, defined as follows:
MRR =1
2|T|
∑(h,r,t)∈T
(1
rankimg(h)+
1rankimg(t)
)(3)
MRR =1|T|
∑(h,r,t)∈T
1rankr
, (4)
where T is the set of all test triples, rankr is the rank of the correctrelation, and rankimg(h) is the rank of the highest ranked image ofentity h. For each query, we remove all triples that are also correctanswers to the query from the ranking. All experiments were runon commodity hardware with 128GB RAM, a single 2.8 GHz CPU,and a NVIDIA 1080 Ti.
5.2 Visual Relation Prediction
Given a pair of unseen images we want to determine the relationsbetween their underlying unknown entities. This can be expressedwith (imgh, r?, imgt). Figure 3(1) illustrates this query type whichwe refer to as visual relation prediction. We train the deep architec-tures using the training and validation triples and images, respec-tively. For each triple (h, r, t) in the training data set, we sample
2018-06-29 17:56. Page 6 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
one training image uniformly at random for both the head and thetail entity. We use the architecture depicted in Figure 6a with thesoftmax activation and the categorical cross-entropy loss. For eachtest triple, we sample one image uniformly at random from the testimages of the head and tail entity, respectively. We then use thepair of images to query the trained deep neural networks. To geta more robust statistical estimate of the evaluation measures, werepeat the above process three times per test triple. Again, none ofthe test triples and images are seen during training nor are any ofthe training images used during testing. Computing the answer toone query takes the model 20 ms.
We compare the proposed architectures to two different base-lines: one based on entity classification followed by a KB embeddingmethod for relation prediction (VGG16+DistMult), and a probabilis-tic baseline (Prob. Baseline). The entity classification baseline con-sists of fine-tuning a pretrained VGG16 to classify images into the14, 870 entities of ImageGraph. To obtain the relation type rankingat test time, we predict the entities for the head and the tail usingthe VGG16 and then use the KB embedding method DistMult[53]to return a ranking of relation types for the given (head, tail) pair.DistMult is a KB embedding method that achieves state of the artresults for KB completion on FB15k [22]. Therefore, for this exper-iment we just substitute the original output layer of the VGG16pretrained on ImageNet with a new output layer suitable for ourproblem. To train, we join the train an validation splits, we setthe learning rate to 10−5 for all the layers and we train followingthe same strategy that we use in all of our experiments. Once thesystem is trained, we test the model by classifying the entities of theimages in the test set. To train DistMult, we sample 500 negativestriples for each positive triple and used an embedding size of 100.Figure 6b illustrates the VGG16+DistMult baseline and contrastsit with our proposed approach. The second baseline (probabilisticbaseline) computes the probability of each relation type using theset of training and validation triples. The baseline ranks relationtypes based on these prior probabilities.
Table 3 lists the results for the two baselines and the differentproposed architectures. The probabilistic baseline outperforms theVGG16+DistMult baseline in 3 of the metrics. This is due to thehighly skewed distribution of relation types in the training, valida-tion, and test triples. A small number of relation types makes up alarge fraction of triples. Figure 4a and 5b plots the counts of relationtypes and entities. Moreover, despite DistMult achieving a hits@1value of 0.46 for the relation prediction problem between entitypairs the baseline VGG16+DistMult performs poorly. This is due tothe poor entity classification performance of the VGG (accurracy:0.082, F1: 0.068). In the remainder of the experiments, therefore,we only compare to the probabilistic baseline. In the lower part ofTable 3, we lists the results of the experiments. DIFF, MULT, andCAT stand for the different possible composition operations. Weomitted the composition operation circular correlation since wewere not able to make the corresponding model converge, despitetrying several different optimizers and hyperparameter settings.The post-fix 1HL stands for architectures where we added an addi-tional hidden layer with ReLu activation before the softmax. Theconcatenation operation clearly outperforms the multiplicationand difference operations. This is contrary to findings in the KGcompletion literature where MULT and DIFF outperformed the
Table 4: Results for the multi-relational image retrieval
problem.
Median Hits@100 MRRModel Head Tail Head Tail Head TailBaseline 6504 2789 11.9 18.4 0.065 0.115
DIFF 1301 877 19.6 26.3 0.051 0.094MULT 1676 1136 16.8 22.9 0.040 0.080CAT 1022 727 21.4 27.5 0.050 0.087DIFF+1HL 1644 1141 15.9 21.9 0.045 0.085MULT+1HL 2004 1397 14.6 20.5 0.034 0.069CAT+1HL 1323 919 17.8 23.6 0.042 0.080CAT-SIG 814 540 23.2 30.1 0.049 0.082
concatenation operation. The models with the additional hiddenlayer did not perform better than their shallower counterparts withthe exception of the DIFF model. We hypothesize that this is dueto difference being the only linear composition operation, benefit-ing from an additional non-linearity. Each of the proposed modelsoutperforms the baselines.
5.3 Multi-Relational Image Retrieval
Given an unseen image, for which we do not know the underlyingKG entity, and a relation type, we want to retrieve existing imagesthat complete the query. If the image for the head entity is given,we return a ranking of images for the tail entity; if the tail entityimage is given we return a ranking of images for the head entity.This problem corresponds to query type (2) in Figure 3. Note thatthis is equivalent to performing multi-relational metric learningwhich, to the best of our knowledge, has not been done before.We performed experiments with each of the three compositionfunctions f and for two different activation/loss functions. First,we used the models trained with the softmax activation and thecategorical cross-entropy loss to rank images. Second, we took themodels trained with the softmax activation and substituted thesoftmax activation with a sigmoid activation and the correspondingbinary cross-entropy loss. For each training triple (h, r, t) we thencreated two negative triples by sampling once the head and oncethe tail entity from the set of entities. The negative triples are thenused in conjunction with the binary cross-entropy loss of equation 1to refine the pretrained weights. Directly training a model withthe binary cross-entropy loss was not possible since the model didnot converge properly. Pretraining with the softmax activation andcategorical cross-entropy loss was crucial to make the binary losswork.
During testing, we used the test triples and ranked the imagesbased on the probabilities returned by the respective models. Forinstance, given the query (imgSenso-ji, locatedIn, imgt?), we sub-stituted imgt? with all training and validation images, one at a time,and ranked the images according to the probabilities returned bythe models. We use the rank of the highest ranked image belongingto the true entity (here: Japan) to compute the values for the eval-uation measures. We repeat the same experiment three times (eachtime randomly sampling the images) and report average values.Again, we compare the results for the different architectures with
2018-06-29 17:56. Page 7 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
...
succeededBy
genreOfwonAward
directedBy
...3
2
159407
106817
Figure 7: Example queries and results for the multi-relational image retrieval problem.
Table 5: Results for the zero-shot learning experiments.
Median Hits@1 Hits@10 MRRH T H T H T H T
Zero-Shot Query (3)Base 34 31 1.9 2.3 18.2 28.7 0.074 0.089CAT 8 7 19.1 22.4 54.2 57.9 0.306 0.342
Zero-Shot Query (4)Base 9 5 13.0 22.6 52.3 64.8 0.251 0.359CAT 5 3 26.9 33.7 62.5 70.4 0.388 0.461
a probabilistic baseline. For the baseline, however, we compute adistribution of head and tail entities for each of the relation types.For example, for the relation type locatedIn we compute two dis-tributions, one for head and one for tail entities. We used the samemeasures as in the previous experiment to evaluate the returnedimage rankings.
Table 4 lists the results of the experiments. As for relation pre-diction, the best performing models are based on the concatenationoperation, followed by the difference and multiplication operations.The architectures with an additional hidden layer do not improvethe performance. We also provide the results for the concatenation-based model with softmax activation where we refined the weightsusing a sigmoid activation and negative sampling as described be-fore. This model is the best performing model. All neural networkmodels are significantly better than the baseline with respect tothe median and hits@100. However, the baseline has slightly su-perior results for the MRR. This is due to the skewed distributionof entities and relations in the KG (see Figure 5b and Figure 4a).This shows once more that the baseline is highly competitive forthe given KG. Figure 7 visualizes the answers the CAT-SIG modelprovided for a set of four example queries. For the two queries onthe left, the model performed well and ranked the correct entity inthe top 3 (green frame). The examples on the right illustrate queriesfor which the model returned an inaccurate ranking. To performquery answering in a highly efficient manner, we precomputed andstored all image embeddings once, and only compute the scoringfunction (involving the composition operation and a dot productwith ϕr) at query time. Answering one multi-relational image re-trieval query (which would otherwise require 613,138 individualqueries, one per possible image) took only 90 ms.
hasCrewJobhasGenre
hasProfession
hasNutrientfilmHasLocationpeopleBornHere...
TaxonomyHasEntry Card GameLibrary of Congress
Back to the Future Special Effects Supervisor
Figure 8: Example results for zero-shot learning. For each
pair of images the top three relation types (as ranked by the
CAT model) are listed. For the pair of images at the top, the
first relation type is correct. For the pair of images at the
bottom, the correct relation type TaxonomyHasEntry is not
among the top three relation types.
5.4 Zero-Shot Visual Relation Prediction
The last set of experiments addresses the problem of zero-shotlearning via visual relation prediction. For both query types, we aregiven an new image of an entirely new entity that is not part of theKG. The first query type asks for relations between the given imageand an unseen image for which we do not know the underlyingKG entity. The second query type asks for the relations betweenthe given image and an existing KG entity. We believe that creat-ing multi-relational links to existing KG entities is a reasonableapproach to zero-shot learning since an unseen entity or categoryis integrated into an existing KG. The relations to existing visualconcepts and their attributes provide a characterization of the newentity/category. This problem cannot be addressed with KG em-bedding methods since entities need to be part of the KG duringtraining for these models to work.
For the zero-shot experiments, we generated a new set of train-ing, validation, and test triples. We randomly sampled 500 entitiesthat occur as head (tail) in the set of test triples. We then removedall training and validation triples whose head or tail is one of these1000 entities. Finally, we only kept those test triples with one ofthe 1000 entities either as head or tail but not both. For query type(4) where we know the target entity, we sample 10 of its imagesand use the models 10 times to compute a probability. We use theaverage probabilities to rank the relations. For query type (3) we
2018-06-29 17:56. Page 8 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
Representation Learning for Visual-RelationalKnowledge Graphs KDD’2018 Workshop, August, 2018, London, England
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
only use one image sampled randomly. As with previous experi-ments, we repeated procedure three times and averaged the results.For the baseline, we compute the probabilities of relation in thetraining and validation set (for query type (3)) and the probabili-ties of relations conditioned on the target entity (for query type(4)). Again, these are very competitive baselines due to the skeweddistribution of relations and entities. Table 5 lists the results ofthe experiments. The model based on the concatenation operation(CAT) outperforms the baseline and performs surprisingly well.The deep models are able to generalize to unseen images sincetheir performance is comparable to the performance in the relationprediction task (query type (1)) where the entity was part of theKG during training (see Table 3). Figure 8 depicts example queriesfor the zero-shot query type (3). For the first query example, theCAT model ranked the correct relation type first (indicated by thegreen bounding box). The second example is more challenging andthe correct relation type was not part of the top 10 ranked relationtypes.
6 CONCLUSION
KGs are at the core of numerous AI applications. Research hasfocused either on KG completion methods working only on therelational structure or on scene understanding in a single image.We present a novel visual-relational KG where the entities areenriched with visual data. We proposed several novel query typesand introduce neural architectures suitable for probabilistic queryanswering. We propose a novel approach to zero-shot learning asthe problem of visually mapping an image of an entirely new entityto a KG.
We have observed that for some relation types, the proposedmodels tend to learn a fine-grained visual type that typically occursas the head or tail of the relation type. In these cases, conditioningon either the head or tail entity does not influence the predictionsof the models substantially. This is a potential shortcoming of theproposed methods and we believe that there is a lot of room forimprovement for probabilistic query answering in visual-relationalKGs.
REFERENCES
[1] Sungjin Ahn, Heeyoul Choi, Tanel Parnamaa, and Yoshua Bengio. 2016. A NeuralKnowledge Language Model. arXiv preprint arXiv:1608.00318 (2016).
[2] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, HeatherButler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T.Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis,Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M.Rubin, and Gavin Sherlock. 2000. Gene Ontology: tool for the unification ofbiology. Nat Genet 25, 1 (2000), 25–29.
[3] J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. 2015. Predicting deep zero-shotconvolutional neural networks using textual descriptions. In CVPR.
[4] Sean Bell and Kavita Bala. 2015. Learning visual similarity for product designwith convolutional neural networks. ACM Transactions on Graphics (TOG) 34, 4(2015), 98.
[5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.2008. Freebase: A Collaboratively Created Graph Database for Structuring HumanKnowledge. In SIGMOD. 1247–1250.
[6] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprintarXiv:1506.02075 (2015).
[7] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-sana Yakhnenko. 2013. Translating embeddings for modeling multi-relationaldata. In Advances in Neural Information Processing Systems. 2787–2795.
[8] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting vi-sual knowledge from web data. In Proceedings of the IEEE International Conference
on Computer Vision. 1409–1416.[9] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F.
Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR.[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR.[11] Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2013. Mid-level Visual Element
Discovery as Discriminative Mode Seeking. In Advances in Neural InformationProcessing Systems 26. 494–502.
[12] Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. 2009. Describingobjects by their attributes. In 2009 IEEE Computer Society Conference on ComputerVision and Pattern Recognition. 1778–1785.
[13] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan.2010. Object detection with discriminatively trained part-based models. IEEEtransactions on pattern analysis and machine intelligence 32, 9 (2010), 1627–1645.
[14] Matt Gardner and Tom M Mitchell. 2015. Efficient and Expressive KnowledgeBase Completion Using Subgraph Feature Extraction.. In EMNLP. 1488–1498.
[15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. RichFeature Hierarchies for Accurate Object Detection and Semantic Segmentation. InProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition.580–587.
[16] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of trainingdeep feedforward neural networks. In AISTATS.
[17] Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. ObservingHuman-Object Interactions: Using Spatial and Functional Compatibility forRecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence31 (2009), 1775–1789.
[18] Kelvin Guu, John Miller, and Percy Liang. 2015. Traversing knowledge graphs invector space. arXiv preprint arXiv:1506.01094 (2015).
[19] Hamid Izadinia, Fereshteh Sadeghi, and Ali Farhadi. 2014. Incorporating scenecontext and object layout into appearance modeling. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 232–239.
[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: ConvolutionalArchitecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
[21] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. Bernstein, and L.Fei-Fei. 2015. Image Retrieval using Scene Graphs. In CVPR.
[22] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. 2017. Knowledge Base Com-pletion: Baselines Strike Back. arXiv preprint arXiv:1705.10744 (2017).
[23] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, JoshuaKravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, MichaelBernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language andVision Using Crowdsourced Dense Image Annotations. In arXiv preprintarXiv:1602.07332.
[24] Ni Lao, Tom Mitchell, and William W Cohen. 2011. Random walk inferenceand learning in a large scale knowledge base. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing. Association for ComputationalLinguistics, 529–539.
[25] Yining Li, Chen Huang, Xiaoou Tang, and Chen Change Loy. 2017. Learning toDisambiguate by Asking Discriminative Questions. In ICCV.
[26] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFash-ion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations.In CVPR.
[27] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. 2016. Visual relationship detectionwith language priors. In ECCV.
[28] Tomasz Malisiewicz and Alexei A. Efros. 2009. Beyond Categories: The VisualMemex Model for Reasoning About Object Relationships. In Advances in NeuralInformation Processing Systems.
[29] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The MoreYou Know: Using Knowledge Graphs for Image Classification. In CVPR.
[30] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016.A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1(2016), 11–33.
[31] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016. HolographicEmbeddings of Knowledge Graphs. In Proceedings of the Thirtieth Conference onArtificial Intelligence. 1955–1961.
[32] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-waymodel for collective learning on multi-relational data. In Proceedings of the 28thinternational conference on machine learning (ICML-11). 809–816.
[33] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deepmetric learning via lifted structured feature embedding. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 4004–4012.
[34] Megha Pandey and Svetlana Lazebnik. 2011. Scene recognition and weaklysupervised object localization with deformable part-based models. In ComputerVision (ICCV), 2011 IEEE International Conference on. 1307–1314.
[35] B. Romera-Paredes and P. Torr. 2015. An embarrassingly simple approach tozero-shot learning. In ICML.
[36] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Sontag. 2017. Learninga Health Knowledge Graph from Electronic Medical Records. Nature Scientific
2018-06-29 17:56. Page 9 of 1–10.
Unpublishedworkingdraft.
Not for distribution.
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
KDD’2018 Workshop, August, 2018, London, England D. Oñoro-Rubio et al.
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
Reports 5994, 7 (2017).[37] Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li Fei-Fei.
2013. Detecting avocados to zucchinis: what have we done, and where are wegoing?. In International Conference on Computer Vision (ICCV).
[38] Fereshteh Sadeghi and Marshall F. Tappen. 2012. Latent Pyramidal Regions forRecognizing Scenes. In Proceedings of the 12th European Conference on ComputerVision - Volume Part V. 228–241.
[39] F. Schroff, D. Kalenichenko, and J. Philbin. 2015. FaceNet: A unified embeddingfor face recognition and clustering. In 2015 IEEE Conference on Computer Visionand Pattern Recognition (CVPR). 815–823.
[40] Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, SarathChandar, Aaron Courville, and Yoshua Bengio. 2016. Generating Factoid Ques-tions With Recurrent Neural Networks: The 30M Factoid Question-Answer Cor-pus. arXiv preprint arXiv:1603.06807 (2016).
[41] Edgar Simo-Serra and Hiroshi Ishikawa. 2016. Fashion Style in 128 Floats: JointRanking and Classification Using Weak Data for Feature Extraction. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR).
[42] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks forLarge-Scale Image Recognition. CoRR (2014).
[43] Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair lossobjective. In Advances in Neural Information Processing Systems. 1857–1865.
[44] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. 2017. DeepMetric Learning via Facility Location. In Conference on Computer Vision andPattern Recognition (CVPR).
[45] Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-StructuredRepresentations for Visual Question Answering. In CVPR.
[46] Kristina Toutanova and Danqi Chen. 2015. Observed versus latent featuresfor knowledge base and text inference. In Proceedings of the 3rd Workshop onContinuous Vector Space Models and their Compositionality. 57–66.
[47] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and GuillaumeBouchard. 2016. Complex embeddings for simple link prediction. arXiv preprintarXiv:1606.06357 (2016).
[48] Kristen Vaccaro, Sunaya Shivakumar, ZiqiaoDing, Karrie Karahalios, and RanjithaKumar. 2016. The Elements of Fashion Style. In Proceedings of the 29th AnnualSymposium on User Interface Software and Technology. ACM.
[49] Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and SergeBelongie. 2015. Learning Visual Clothing Style with Heterogeneous DyadicCo-occurrences. In ICCV.
[50] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. DeepMetric Learning with Angular Loss. In International Conference on ComputerVision (ICCV).
[51] David S. Wishart, Craig Knox, Anchi Guo, Dean Cheng, Savita Shrivastava, DanTzur, Bijaya Gautam, and Murtaza Hassanali. 2008. DrugBank: a knowledgebasefor drugs, drug actions and drug targets. Nucleic Acids Research 36 (2008), 901–906.
[52] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba.2010. SUN database: Large-scale scene recognition from abbey to zoo. In TheTwenty-Third IEEE Conference on Computer Vision and Pattern Recognition. 3485–3492.
[53] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Learn-ing multi-relational semantics using neural-embedding models. arXiv preprintarXiv:1411.4072 (2014).
[54] Bangpeng Yao and Li Fei-Fei. 2010. Modeling mutual context of object andhuman pose in human-object interaction activities. InComputer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on. 17–24.
[55] Z. Zhang and V. Saligrama. 2015. Zero-shot learning via semantic similarityembedding. In ICCV.
[56] Yuke Zhu, Alireza Fathi, and Li Fei-Fei. 2014. Reasoning about object affordancesin a knowledge base representation. In European conference on computer vision.408–424.
2018-06-29 17:56. Page 10 of 1–10.
Top Related