Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent...

Image-Based Question Answering with Visual-SemanticEmbedding

Mengye Ren

Engineering Science, ECE OptionUniversity of Toronto

[email protected]

Supervisor: Prof. Richard S. Zemel

March 31, 2015

Mengye Ren ImageQA March 31, 2015 1 / 27

Interactive output with natural language

Recent progress in object recognition and detection using deeplearning.

Can we interpret images in more interactive ways?

Caption generation: describe an image with a sentence.

Question answering on the image.

Figure: Object recognition example [Krizhevsky et al., 2012]


Interactive output with natural language

Recent progress in object recognition using deep learning.

Can we interpret images in more interactive ways?

Caption generation: describe an image with a sentence.

Question answering on the image.

Figure: Caption generation example [Kiros et al., 2014]


Question answering on images

See the world and interact with people (AI dream).

Retrieve and describe the relevant part of the image.

-What is on the table ?


Question answering on images

See the world and interact with people (AI dream).

Retrieve and describe the relevant part of the image.

-What is on the table ? -Books.


Problem formulation

The input is an image with a question (a sequence of words).

The output is an answer (a sequence of words).

Find a function that maps the input to the correct output.

Want to use a joint embedding vector space to represent both imageand language.


Long short-term memory (LSTM)

Recurrent neural networks (RNNs): “feedback” connection from theoutput back to input.

Aim to capture sequential patterns.

Gradient explosion and vanishing in training traditional RNNs.

LSTM [Hochreiter et al., 1997]: use linear propagation for memorycells, non-linear activation only at input and output.

Training is much faster and stable.

Recent success in natural language proccessing (NLP).

Xt

Xt

Xt

Xt

Yt

Yt−1

Yt−1

Yt−1

Ct

Ct−1 Ct

It Ot

Ft

Figure: RNN (left) and long short-term memory (right)


Visual semantic embedding

Images can be represented in feature space, and words can berepresented in vectors (word embedding).

“king” - “man” + “woman” ∼ “queen” [Mikolov et al., 2013]

Goal: project image and words into a common space.

Figure: DeViSE: A Deep Visual-Semantic Embedding Model [Frome et al., 2013]


Previous attempt

A multi-world approach to question answering about real-worldscenes based on uncertain input [Malinowski et al., 2014]

Use automatic image segmentation with uncertainty.

Parse the question into logical form.

Search for nearest neighbours in the training set to make inference ondifferent possible segmentations.

Does not scale to large dataset.

Results can be improved by a lot.


DAQUAR dataset

Around 1500 images, 7000 questions (37 classes).

Three types of questions: Object type, object color, number of object

98.3% of the answers are composed of single word only.

Q7: what color is the ornamental plant in front of the fan coil but not close to sofa ?-red


Image-word model

Idea: treat image as the first word of the question sequence.

Use the last hidden layer of the Oxford VGG conv net[Simonyan et al., 2014] as the feature extractor.

Map image features into the joint embedding space.

We can also use random numbers as images and call it a “blind”model.

t = 1 t = 2 t = T

how many books

dropout p

LSTM

CNN

softmax

two three five.56 .21 .09

random

image

blind

image-word

word embedding


Q193: what is the largest object ?Correct answer: table

Image-word: table (0.576)Blind: bed (0.440)


Bidirectional image-word model

At the last timestep, the model needs to immidiately come up with ananswer.Sometimes the main body of the question is at the end.The 1st LSTM reads the sentence from beginning to end, and the2nd one reads from end to beginning.Benefit: the 2nd LSTM “knows” the whole sentence at each timestep.

t = 1 t = 2 t = T

how many books

dropout p1

forward LSTM

dropout p2, p3

backward LSTM

CNN

softmax

two three five.56 .21 .09

random

image

blind

image-word

word embedding


Image-word model results

Table: DAQUAR results

Accuracy WUPS 0.9 WUPS 0.02-IMGWD 0.3276 0.3298 0.7272IMGWD 0.3188 0.3211 0.7279BLIND 0.3051 0.3069 0.7229RANDWD 0.3036 0.3056 0.7206BOW 0.2299 0.2340 0.6907GUESS 0.1756[Malinowski et al., 2014] 0.1273 0.1810 0.5147HUMAN 0.6027 0.6104 0.7896


Image-word model results summary

The results are strong, but blind model can do almost equally well!

Maybe we need a better dataset.1 1500 images are a very small dataset.2 Many questions are not very obvious to answer.

i.e. The images features from the CNN are not that useful.


Synthetic question-answer pairs

Recently released large dataset on image descriptions.

Idea: transform descriptions into questions and answers.

Approach: use Stanford parser [Klein et al., 2003] to parsedescriptions into syntactic trees, and operate on the tree structure.

Example: A man is riding a horse => What is the man riding?


COCO-QA dataset

Dataset # Images # Questions # Answers GUESS baseline6.6K 3.6K+3K 14K+11.6K 298 0.1156Full 80K+20K 177K+83K 794 0.0346

Used Microsoft COCO dataset [Lin et al., 2014]. 80K+20K images,400K+100K descriptions.

Three types of questions: object, colour, number.

Reduced the number of common answers (e.g. “man”, “white”, and“two”). The probability of an answer appearing again is decreasing.


COCO-QA results

Q3082: how many separate benches are people sitting on on a sidewalk ?-Two


Learning results on COCO-QA

Table: COCO-QA results

6.6K FullAcc. WUPS 0.9 WUPS 0 Acc. WUPS 0.9 WUPS 0

2-IMGWD 0.3358 0.3454 0.7534 0.3208 0.3304 0.7393IMGWD 0.3260 0.3357 0.7513 0.3153 0.3245 0.7359

BOW 0.1910 0.2018 0.6968 0.2365 0.2466 0.7058RANDWD 0.1366 0.1545 0.6829 0.1971 0.2058 0.6862BLIND 0.1321 0.1396 0.6676 0.2517 0.2611 0.7127GUESS 0.1156 0.0346


Bidirectional model comparison

Q2352: what is being pulled on a runway ?2IMGWD: plane (0.3746) airplane (0.3583) jet (0.1609)IMGWD: bus (0.1393) airplane (0.0888) truck (0.0867)


Summary

LSTM is a very suitable for language modelling.

Bidirectional model improves language understanding.

Still doesn’t quite know how to count, “two” is the best guess.

Very limited colour recognition ability, mostly from knowledge of thequestion: e.g. “flower” is always yellow.

how many jet airplanes are flying in formation in a cloudless sky ?IMGWD: four (0.4215)


Future works

Visual attention model

Better dataset (human or machine generated)

Better image understanding

Longer answers


Acknowledgement

Ryan Kiros and Rich Zemel

Thanks Russ Salakhudinov for helpful discussion

Thanks Nitish Srivastava for Toronto Conv Net

Thanks UofT CS department machine learning group for computingresources


References

Mateusz Malinowski and Mario Fritz (2014)

A multi-world approach to question answering about real-world scenes based onuncertain Input

Neural Information Processing Systems (NIPS’14)

Ryan Kiros and Ruslan Salakhutdinov and Richard S. Zemel, 2014

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

NIPS Deep Learning Workshop (2014)

Sepp Hochreiter and Jurgen Schmidhuber

Long Short-Term Memory

Neural Computation 9(8) 1735–1780


References

Andrea Frome and Gregory S. Corrado and Jonathon Shlens and Samy Bengio andJeffrey Dean and Marc’Aurelio Ranzato and Tomas Mikolov (2013)

DeViSE: A Deep Visual-Semantic Embedding Model

Advances in Neural Information Processing Systems 26: 27th Annual Conferenceon Neural Information Processing Systems 2013, 2121–2129

Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean (2013)

Efficient Estimation of Word Representations in Vector Space

http://arxiv.org/abs/1301.3781

Karen Simonyan and Andrew Zisserman (2014)

Very Deep Convolutional Networks for Large-Scale Image Recognition

http://arxiv.org/abs/1409.1556

Dan Klein and Christopher D. Manning (2003)

Accurate Unlexicalized Parsing

In proceedings of the 41st Annual Meeting of the Association for ComputationalLinguistics (2003) 423–430


References

Alex Krizhevsky and Ilya Sutskever and Geoffrey E. Hinton (2012)

ImageNet Classification with Deep Convolutional Neural Networks

Advances in Neural Information Processing Systems 25: 26th Annual Conferenceon Neural Information Processing Systems 2012. 1106–1114.

Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and PietroPerona and Deva Ramanan and Piotr Dollar and C. Lawrence Zitnick (2014)

Microsoft COCO: Common Objects in Context

Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proceedings, Part V 740–755

Nathan Silberman and Derek Hoiem and Pushmeet Kohli and Rob Fergus

Indoor Segmentation and Support Inference from RGBD Images

ECCV (2012)


The End


Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent...

Documents

Transcript of Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent...