simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning...
Transcript of simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning...
![Page 1: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/1.jpg)
|X
simNet: Stepwise Image-Topic Merging Network for
Generating Detailed and Comprehensive Image Captions
1
Fenglin Liu¹, Xuancheng Ren²*, Yuanxin Liu¹, Houfeng Wang² and Xu Sun²
¹ Beijing University of Posts and Telecommunications, Beijing, China
² Peking University, Beijing, China
* Equal Contributions
![Page 2: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/2.jpg)
|X
CONTENTS
1 Introduction
2 Approach
3 Experiment
4 Analysis
![Page 3: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/3.jpg)
|X
Introduction
![Page 4: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/4.jpg)
|X
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions
4
·Soft-Attention: Show, attend and tell: Neural image caption generation with visual attention. In PMLR 2015
·ATT-FCN : Image captioning with semantic attention. In CVPR 2016
Figure 1: Examples of using different attention mechanisms.
![Page 5: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/5.jpg)
|X
Introduction: Soft-Attention
5
Image Image Features Caption
encode decodeSoft-Attention:
·Soft-Attention: Show, attend and tell: Neural image caption generation with visual attention. In PMLR 2015
omitting “dog” and “mouse”
![Page 6: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/6.jpg)
|X
Introduction: ATT-FCN
6
Image Image Keywords Caption
encode decodeATT-FCN:
·ATT-FCN : Image captioning with semantic attention. In CVPR 2016
missing “open” and mislocating “dog”
![Page 7: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/7.jpg)
|X
Introduction: SimNet
7
Image
Image Features
decode
Image Keywords
encode CaptionsimNet:
![Page 8: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/8.jpg)
|X
Introduction: Main idea
8
The visual information captured by CNN
The topics extracted by a topic extractor
The merging gate then adaptively adjusts the weight between visual attention and topic attention
Figure 2: Illustration of the main idea.
![Page 9: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/9.jpg)
|X
Contributions
9
We propose a novel approach that can effectively merge the information in the image and the topics.
The generated captions are both detailed and comprehensive.
The proposed approach outperforms previous works in terms of SPICE, which correlates the best with human judgments.
![Page 10: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/10.jpg)
|X
Approach
![Page 11: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/11.jpg)
|X 11
Overview
![Page 12: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/12.jpg)
|X 12
Approach: Image Encoder
Image Encoder: ResNet152
He et al., 2016: Deep residual learning for image recognition. In CVPR 2016.
![Page 13: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/13.jpg)
|X 13
Approach: Image Encoder
Feature map:
![Page 14: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/14.jpg)
|X 14
Approach: Topic Extractor
Topic Extractor: Multiple Instance Learning
Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.
Fang et al., 2015: From captions to visual concepts and back. In CVPR2015
![Page 15: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/15.jpg)
|X 15
Approach: Input Attention
Input Attention:
Xu et al., 2015 : Show, attend and tell: Neural image caption generation with visual attention. In PMLR 2015
![Page 16: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/16.jpg)
|X 16
Approach: Output Attention
Output Attention:
You et al., 2016 : Image captioning with semantic attention. In CVPR 2016
Lu et al ., 2017 : Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR 2017
![Page 17: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/17.jpg)
|X 17
Approach: Visual Information
the visual information:
Output Attention:
![Page 18: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/18.jpg)
|X 18
Approach: Previous Topic Attention
Topic Attention (Previous work):
Lacking the attentive visual informationwhen selecting topic!
You et al., 2016 : Image captioning with semantic attention. In CVPR 2016
![Page 19: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/19.jpg)
|X 19
Approach: Our Topic Attention
Topic Attention (Our):
![Page 20: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/20.jpg)
|X 20
Approach: Contextual Information
Topic Attention (Our):
the contextual information:
![Page 21: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/21.jpg)
|X 21
Approach: Merging Gate
How to make full use of the visual information and the contextual information?
![Page 22: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/22.jpg)
|X 22
Approach: Merging Gate
Visual information(e.g., “behind”, “red” is better)
Contextual information(e.g., “people”, “table” is better)
![Page 23: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/23.jpg)
|X 23
Approach: Merging Gate
( Where σ is the sigmoid function )
![Page 24: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/24.jpg)
|X 24
Approach: Merging Gate
The scoring function S is designed to evaluate the importance of the topic attention.
![Page 25: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/25.jpg)
|X 25
Approach: Merging Gate
![Page 26: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/26.jpg)
|X 26
Approach: Merging Gate
Share Weights
Share Weights
![Page 27: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/27.jpg)
|X 27
Generating Words
the contextual information:
![Page 28: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/28.jpg)
|X
Experiments
![Page 29: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/29.jpg)
|X
Experiments
29
Microsoft COCO(MSCOCO) and Flickr30k
DatasetEvaluation Metrics
Sparrow bird on branch, withbeak inspecting leaves on branch.
A bird sitting on the branch of atree near leaves.
A bird that is sitting in a tree. a bird sitting on a branch of a
tree. a bird that is on a small branch
of a tree.
SPICE
CIDEr
BLEU
METEOR
ROUGE
Correlates the best with human Judgments !
![Page 30: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/30.jpg)
|X
Experiments: Results (MSCOCO)
30
ComparableModels
![Page 31: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/31.jpg)
|X
Experiments: Results (MSCOCO)
31
Competitive
![Page 32: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/32.jpg)
|X
Analysis
![Page 33: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/33.jpg)
|X
Analysis: The Contributions of The Sub-modules
33
Comprehensiveness Detailedness
![Page 34: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/34.jpg)
|X
Analysis: Output Attention
34
The output attention is much more effective than the input attention
![Page 35: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/35.jpg)
|X
Analysis: Visual Attention
35
A combination of the input attention and the output attention makes the results even better
![Page 36: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/36.jpg)
|X
Analysis: Topic Attention
36
The topic attention is better at identifying objects but worse at identifying attributes.
![Page 37: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/37.jpg)
|X
Analysis: Visual Attention + Topic Attention
37
Combing the visual attention and the topic attention directly results in a huge boost in performance
![Page 38: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/38.jpg)
|X
Analysis: Full Model
38
Applying the merging gate is essential to the overall performance.
![Page 39: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/39.jpg)
|X
Analysis: Visualization
39
The upper part shows the attention weights of each of 5 extracted topics. Deeper color means larger in value. The middle part shows the value of the merging gate. Determines the importance of the topic attention. The lower part shows the visualization of visual attention. The blue shade indicates the output attention.
The red shade indicates the input attention.
![Page 40: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/40.jpg)
|X
Analysis: Visualization
40
Visual information “chair” is more important than contextual information “bed”
![Page 41: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/41.jpg)
|X
Analysis: Examples
41
erroneoustopic “kitchen”
lacking “mouse”
missing “wooden” error count
![Page 42: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/42.jpg)
|X
Conclusion
42
Stepwise image-topic merging network can adaptively combine the visual and the semantic attention to achieve substantial improvements.
The generated captions are both detailed and comprehensive
Our approach outperforms previous works in terms of SPICE on COCO and Flickr datasets.
![Page 43: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/43.jpg)
|X 43
Thank you!If you have any questions about our paper, you can send a email to [email protected]
![Page 44: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/44.jpg)
|X
Experiments: Results (Flickr30k)
44
![Page 45: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/45.jpg)
|X
Analysis: Topic Extraction
45
The reason of using objects as topics is that they are easier to identify so that the generation suffers less from erroneous predictions.
It proves that for semantic attention, it is also important to limit the absolute number of incorrect visual words instead of merely the precision or the recall.
![Page 46: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/46.jpg)
|X
Analysis: Merging Gate
46
![Page 47: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/47.jpg)
|X
Analysis: Error Analysis
47
There are mainly three types of errors, i.e. distance (32, 26%), movement (22,18%), and object(60, 49%), with 9 (7%) other errors.
![Page 48: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/48.jpg)
|X
Analysis: Error Analysis
48
There are mainly three types of errors, i.e. distance (32, 26%), movement (22,18%), and object(60, 49%), with 9 (7%) other errors.
![Page 49: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/49.jpg)
|X
Experiments: Results (MSCOCO)
49
![Page 50: simNet: Stepwise Image-Topic Merging Network for ... · Topic Extractor: Multiple Instance Learning Zhang et al., 2006: Multiple instance boosting for object detection. In NIPS 2006.](https://reader034.fdocuments.in/reader034/viewer/2022042805/5f621473168a6b6f0347c6a4/html5/thumbnails/50.jpg)
|X
Analysis: Topic Attention
50