Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for...

50
Multimodal Learning for Image Captioning and Visual Question Answering Xiaodong He Deep Learning Technology Center Microsoft Research UC Berkeley, April 7 th , 2016

Transcript of Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for...

Page 1: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Multimodal Learning for Image Captioning and Visual Question Answering

Xiaodong He

Deep Learning Technology Center

Microsoft Research

UC Berkeley, April 7th, 2016

Page 2: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 3: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 4: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Knowledge

VisionText

Barack Obama is an American politician

serving as the 44th President of the

United States. Born in Honolulu, Hawaii,

… in 2008, he defeated Republican

nominee and was inaugurated as president

on January 20, 2009.(Wikipedia.org)

http://s122.photobucket.com/user/b

meuppls/media/stampede.jpg.html

Freebase

Page 5: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

a man holding a tennis racquet on a tennis court

the man is on the tennis court playing a game

Image Captioning (one step from perception to cognition)

describe objects, attributes, and relationship in an image, in a

natural language form

Page 6: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Two entries tied at the 1st place at COCO 2015 Caption Challenge

Adopted encoder-decoder framework from machine translation, Popular: Google, Montreal, Stanford, Berkeley

Visual concept detection => caption candidates generation => Deep semantic ranking

Compositional framework can potentially exploit non paired image-

caption data more effectively

[Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, Mitchell, Platt, Zitnick,

Zweig, “From Captions to Visual Concepts and Back,” CVPR, June 2015]

Vinyals, Toshev, Bengio, Erhan, "Show and Tell: A Neural

Image Caption Generator,“ CVPR, June 2015

Page 7: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

sitting

Page 8: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 9: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 10: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

cabinets

woodenkitchen

sink cabinets Repeat to generate 500 candidatesfloor

room

stove

[Fang, et al., CVPR 2015]

Page 11: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Huang, He, Gao, Deng, Acero, Heck, “Learning Deep

Structured Semantic Model for Web Search,“ CIKM, 2013

Page 12: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 13: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

15K 15K 15K 15K 15K

500 500 500

max max

...

...

... max

500

...

...

Word hashing layer: ft

Convolutional layer: ht

Max pooling layer: v

Semantic layer: y

<s> w1 w2 wT <s>Word sequence: xt

Word hashing matrix: Wf

Convolution matrix: Wc

Max pooling operation

Semantic projection matrix: Ws

... ...

500

a man… bench

Page 14: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

– What does the model learn at the

convolutional layer?

Capture the local context dependent word sense• Learn one embedding vector for each local context-

dependent word

car body shop cosine

similarity

car body kits 0.698

auto body repair 0.578

auto body parts 0.555

wave body language 0.301

calculate body fat 0.220

forcefield body armour 0.165

The similarity between different “body” within contexts

high

similarity

low

similarity

wave body language

car body kits

auto body part

auto body repair

car body shop

forcefield body armour

calculate body fat

semantic space

auto body repair …

ℎ𝑡 = 𝑊𝑐 × [𝑓𝑡−1, 𝑓𝑡, 𝑓𝑡+1]

Page 15: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

global intent

𝑣 𝑖 = max𝑡=1,…,𝑇

ℎ𝑡(𝑖)

auto body repair cost calculator software

Words that win the most active neurons at the max-

pooling layers:

Usually, those are salient words containing clear intents/topics

𝑖 = 1,… , 300

ℎ1

𝑣

ℎ2 ℎ𝑇

Page 16: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

1 9

17

25

33

41

49

57

65

73

81

89

97

105

113

121

129

137

145

153

161

169

177

185

193

Mean Reciprocal Rank % (ranking among 5000

candidates on the 5K validation set)

CDSSM d=300

CDSSM d=1000

DSSM d=300

3

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4

1 8

15

22

29

36

43

50

57

64

71

78

85

92

99

106

113

120

127

134

141

148

155

162

169

176

183

190

197

Hamonic Mean Rank (ranking among 5000

candidates on the 5K val set)

CDSSM d=300

CDSSM d=1000

DSSM d=300

Page 17: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Turing Test Results

at the MS COCO Captioning Challenge 2015

% of captions

that pass the

Turing Test

Official

Rank

MSR 32.2% 1st

Google 31.7% 1st

MSR Captivator 30.1% 3rd

Montreal/Toronto 27.2% 3rd

Berkeley LRCN 26.8% 5th

Other groups: Baidu/UCLA, Stanford, Tsinghua, etc.

Human 67.5% --

Still a big gap!

Page 18: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 19: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

System BLEU % Better or

Equal to

Human

Model 1: MELM + DMSM 25.7 34.0%

Model 2: MRNN 25.7 29.0%

Devlin, Cheng, Fang, Gupta, Deng, He, Zweig, and Mitchell “Language

Models for Image Captioning: The Quirks and What Works,” ACL 2015

Human judgers shown generated caption and human caption, choose which is “better”, or equal.

Page 20: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 21: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Example: MELM+DMSM: “A plate with a sandwich and a cup of coffee”

MRNN: “A close up of a plate of food” (more generic)

Page 22: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Page 23: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Visual concepts

Celebrity

Landmark

Language Model

Confidence Model

DMSMFeatures vector

A small boat in Ha Long Bay

This image contains: water, boat, lake, mountain, etc.

low

highConvNets

[Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun,

Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz

submitted to CVPR Deep Vision 2016]

Page 24: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

[He, Zhang, Ren, Sun, 2015]

Page 25: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

The deep multimodal semantic model semantic space:

The overall semantics of a caption will also be represented by a vector in this space.

If these two vectors are close to each other, then

the caption is a good match for the image.

Otherwise, not a matching caption.

Image feature

H1

H2

H3

W1

W2

W3

W4

Input s

H3

Text: a man holding a tennis

racquet on a tennis court

H1

H2

H3

W1

W2

W3

Input t1

H3W4

Raw Image pixels

Convolution/pooling

Fully connected

[Fang, et al., CVPR 2015]

[Huang, He, Gao, Deng et al., 2013]

[He, Zhang, Ren, Sun, 2015]

Page 26: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

[Guo, Zhang, Hu, He, Gao, 2016]

Page 27: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Image

H1

H2

H3

W1

W2

W3

W4

Input s

H3

caption: a man holding a

tennis racquet on a tennis

court

H1

H2

H3

W1

W2

W3

Input t1

H3

W4

Page 28: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

System Excellent Good Bad Embarrassing

Fang et al.,

2015

40.6% 26.8% 28.8% 3.8%

New

system

51.8% 23.4% 22.5% 2.4%

Human evaluation on 1000 random samples of the COCO test set.

Page 29: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

System Excellent Good Bad Embarrassing

Fang et al.,

2015

12.0% 13.4% 63.0% 11.6%

New

system

25.4% 24.1% 45.3% 5.2%

Human evaluation on Instagram test set, which contains 1380 random images that

we scraped from Instagram.

Page 30: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Conf. score Excellent Good Bad Embarrassing

mean 0.59 0.51 0.26 0.20

Std dev 0.21 0.23 0.21 0.19

Page 31: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

a man wearing a suit and tieIan Somerhalder wearing a suit and tie

a man taking a picture in front of a mirroran picture about person

a woman standing in front of a christmas treea woman standing next to a window

a black and white photo of a man wearing a hata man posing for a picture

Above: Fang2015Below: Ours

a man on a skateboardthis picture is about photo

a man holding a stop signa man holding a stop sign

a colorful kite flying in the aira table topped with a kite

a couple of people at nighta fire hydrant that is lit up at night

a black and white photo of a man wearing a hata man wearing a bow tie looking at the camera

a view of a sunset over watera view of a sunset in the ocean

a dog sitting on top of a grass covered fielda dog sitting in the grass

a man holding a baseball bat at a balla man swinging a baseball bat in front of a crowd

Page 32: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

a woman sitting on a couchthis picture is about person

a woman holding a red umbrellathe image is about person

two women standing in front of a cakea woman posing for a picture

a man holding a baseball bat on a fielda boy standing in front of a building

a person holding a cell phonea hand holding a cell phone

a man holding a teddy beara picture about table

a pair of scissors sitting on top of a tablea bunch of different items

a woman sitting on a bencha woman sitting on a bench

a black and white photo of a woman brushing her haira woman standing in front of a mirror

a man and a woman wearing a tiea couple posing for a photo

a pair of scissorsthe image is about clothing

a group of pictures on the wallthis picture seems contain text

Page 34: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 35: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

when Jen-Hsun Huang was giving a keynote

showing off a GPU-powered VR visiting of mt.

Everest -- here is what our CaptionBot has to say.

Page 36: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 37: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 38: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, "Stacked Attention Networks for Image

Question Answering," CVPR 2016 (oral)

Page 39: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 40: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 41: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 42: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 43: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 44: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Big improvement

Page 45: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

umbrella

Page 46: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 47: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing
Page 48: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

a herd of elephants standing next to a man

a herd of elephants standing next to Obama

Obama is chased by his republic competitors Image credit:

http://s122.photobucket.com/user/bmeup

pls/media/stampede.jpg.html

Republic Party

Obama the president from Democratic party

whose competitor is Republic party

mascot is Elephant

Who is that person?

What are behind that man?

Why these elephants are chasing him?

Page 49: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

Character-Level Question Answering with Attention

Reasoning in Vector Space: An Exploratory Study of Question Answering

Deep Reinforcement Learning with an Action Space Defined by Natural Language

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base

Semantic Parsing for Single-Relation Question Answering

Embedding Entities and Relations for Learning and Inference in Knowledge Bases

Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Page 50: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing

http://CaptionBot.ai