Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent...

27
Image-Based Question Answering with Visual-Semantic Embedding Mengye Ren Engineering Science, ECE Option University of Toronto [email protected] Supervisor: Prof. Richard S. Zemel March 31, 2015 Mengye Ren ImageQA March 31, 2015 1 / 27

Transcript of Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent...

Page 1: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Image-Based Question Answering with Visual-SemanticEmbedding

Mengye Ren

Engineering Science, ECE OptionUniversity of Toronto

[email protected]

Supervisor: Prof. Richard S. Zemel

March 31, 2015

Mengye Ren ImageQA March 31, 2015 1 / 27

Page 2: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Interactive output with natural language

Recent progress in object recognition and detection using deeplearning.

Can we interpret images in more interactive ways?

Caption generation: describe an image with a sentence.

Question answering on the image.

Figure: Object recognition example [Krizhevsky et al., 2012]

Mengye Ren ImageQA March 31, 2015 2 / 27

Page 3: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Interactive output with natural language

Recent progress in object recognition using deep learning.

Can we interpret images in more interactive ways?

Caption generation: describe an image with a sentence.

Question answering on the image.

Figure: Caption generation example [Kiros et al., 2014]

Mengye Ren ImageQA March 31, 2015 3 / 27

Page 4: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Question answering on images

See the world and interact with people (AI dream).

Retrieve and describe the relevant part of the image.

-What is on the table ?

Mengye Ren ImageQA March 31, 2015 4 / 27

Page 5: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Question answering on images

See the world and interact with people (AI dream).

Retrieve and describe the relevant part of the image.

-What is on the table ? -Books.

Mengye Ren ImageQA March 31, 2015 5 / 27

Page 6: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Problem formulation

The input is an image with a question (a sequence of words).

The output is an answer (a sequence of words).

Find a function that maps the input to the correct output.

Want to use a joint embedding vector space to represent both imageand language.

Mengye Ren ImageQA March 31, 2015 6 / 27

Page 7: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Long short-term memory (LSTM)

Recurrent neural networks (RNNs): “feedback” connection from theoutput back to input.

Aim to capture sequential patterns.

Gradient explosion and vanishing in training traditional RNNs.

LSTM [Hochreiter et al., 1997]: use linear propagation for memorycells, non-linear activation only at input and output.

Training is much faster and stable.

Recent success in natural language proccessing (NLP).

Xt

Xt

Xt

Xt

Yt

Yt−1

Yt−1

Yt−1

Ct

Ct−1 Ct

It Ot

Ft

Figure: RNN (left) and long short-term memory (right)

Mengye Ren ImageQA March 31, 2015 7 / 27

Page 8: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Visual semantic embedding

Images can be represented in feature space, and words can berepresented in vectors (word embedding).

“king” - “man” + “woman” ∼ “queen” [Mikolov et al., 2013]

Goal: project image and words into a common space.

Figure: DeViSE: A Deep Visual-Semantic Embedding Model [Frome et al., 2013]

Mengye Ren ImageQA March 31, 2015 8 / 27

Page 9: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Previous attempt

A multi-world approach to question answering about real-worldscenes based on uncertain input [Malinowski et al., 2014]

Use automatic image segmentation with uncertainty.

Parse the question into logical form.

Search for nearest neighbours in the training set to make inference ondifferent possible segmentations.

Does not scale to large dataset.

Results can be improved by a lot.

Mengye Ren ImageQA March 31, 2015 9 / 27

Page 10: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

DAQUAR dataset

Around 1500 images, 7000 questions (37 classes).

Three types of questions: Object type, object color, number of object

98.3% of the answers are composed of single word only.

Q7: what color is the ornamental plant in front of the fan coil but not close to sofa ?-red

Mengye Ren ImageQA March 31, 2015 10 / 27

Page 11: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Image-word model

Idea: treat image as the first word of the question sequence.

Use the last hidden layer of the Oxford VGG conv net[Simonyan et al., 2014] as the feature extractor.

Map image features into the joint embedding space.

We can also use random numbers as images and call it a “blind”model.

t = 1 t = 2 t = T

how many books

dropout p

LSTM

CNN

softmax

two three five.56 .21 .09

random

image

blind

image-word

word embedding

Mengye Ren ImageQA March 31, 2015 11 / 27

Page 12: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Q193: what is the largest object ?Correct answer: table

Image-word: table (0.576)Blind: bed (0.440)

Mengye Ren ImageQA March 31, 2015 12 / 27

Page 13: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Bidirectional image-word model

At the last timestep, the model needs to immidiately come up with ananswer.Sometimes the main body of the question is at the end.The 1st LSTM reads the sentence from beginning to end, and the2nd one reads from end to beginning.Benefit: the 2nd LSTM “knows” the whole sentence at each timestep.

t = 1 t = 2 t = T

how many books

dropout p1

forward LSTM

dropout p2, p3

backward LSTM

CNN

softmax

two three five.56 .21 .09

random

image

blind

image-word

word embedding

Mengye Ren ImageQA March 31, 2015 13 / 27

Page 14: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Image-word model results

Table: DAQUAR results

Accuracy WUPS 0.9 WUPS 0.02-IMGWD 0.3276 0.3298 0.7272IMGWD 0.3188 0.3211 0.7279BLIND 0.3051 0.3069 0.7229RANDWD 0.3036 0.3056 0.7206BOW 0.2299 0.2340 0.6907GUESS 0.1756[Malinowski et al., 2014] 0.1273 0.1810 0.5147HUMAN 0.6027 0.6104 0.7896

Mengye Ren ImageQA March 31, 2015 14 / 27

Page 15: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Image-word model results summary

The results are strong, but blind model can do almost equally well!

Maybe we need a better dataset.1 1500 images are a very small dataset.2 Many questions are not very obvious to answer.

i.e. The images features from the CNN are not that useful.

Mengye Ren ImageQA March 31, 2015 15 / 27

Page 16: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Synthetic question-answer pairs

Recently released large dataset on image descriptions.

Idea: transform descriptions into questions and answers.

Approach: use Stanford parser [Klein et al., 2003] to parsedescriptions into syntactic trees, and operate on the tree structure.

Example: A man is riding a horse => What is the man riding?

Mengye Ren ImageQA March 31, 2015 16 / 27

Page 17: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

COCO-QA dataset

Dataset # Images # Questions # Answers GUESS baseline6.6K 3.6K+3K 14K+11.6K 298 0.1156Full 80K+20K 177K+83K 794 0.0346

Used Microsoft COCO dataset [Lin et al., 2014]. 80K+20K images,400K+100K descriptions.

Three types of questions: object, colour, number.

Reduced the number of common answers (e.g. “man”, “white”, and“two”). The probability of an answer appearing again is decreasing.

Mengye Ren ImageQA March 31, 2015 17 / 27

Page 18: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

COCO-QA results

Q3082: how many separate benches are people sitting on on a sidewalk ?-Two

Mengye Ren ImageQA March 31, 2015 18 / 27

Page 19: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Learning results on COCO-QA

Table: COCO-QA results

6.6K FullAcc. WUPS 0.9 WUPS 0 Acc. WUPS 0.9 WUPS 0

2-IMGWD 0.3358 0.3454 0.7534 0.3208 0.3304 0.7393IMGWD 0.3260 0.3357 0.7513 0.3153 0.3245 0.7359

BOW 0.1910 0.2018 0.6968 0.2365 0.2466 0.7058RANDWD 0.1366 0.1545 0.6829 0.1971 0.2058 0.6862BLIND 0.1321 0.1396 0.6676 0.2517 0.2611 0.7127GUESS 0.1156 0.0346

Mengye Ren ImageQA March 31, 2015 19 / 27

Page 20: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Bidirectional model comparison

Q2352: what is being pulled on a runway ?2IMGWD: plane (0.3746) airplane (0.3583) jet (0.1609)IMGWD: bus (0.1393) airplane (0.0888) truck (0.0867)

Mengye Ren ImageQA March 31, 2015 20 / 27

Page 21: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Summary

LSTM is a very suitable for language modelling.

Bidirectional model improves language understanding.

Still doesn’t quite know how to count, “two” is the best guess.

Very limited colour recognition ability, mostly from knowledge of thequestion: e.g. “flower” is always yellow.

how many jet airplanes are flying in formation in a cloudless sky ?IMGWD: four (0.4215)

Mengye Ren ImageQA March 31, 2015 21 / 27

Page 22: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Future works

Visual attention model

Better dataset (human or machine generated)

Better image understanding

Longer answers

Mengye Ren ImageQA March 31, 2015 22 / 27

Page 23: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

Acknowledgement

Ryan Kiros and Rich Zemel

Thanks Russ Salakhudinov for helpful discussion

Thanks Nitish Srivastava for Toronto Conv Net

Thanks UofT CS department machine learning group for computingresources

Mengye Ren ImageQA March 31, 2015 23 / 27

Page 24: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

References

Mateusz Malinowski and Mario Fritz (2014)

A multi-world approach to question answering about real-world scenes based onuncertain Input

Neural Information Processing Systems (NIPS’14)

Ryan Kiros and Ruslan Salakhutdinov and Richard S. Zemel, 2014

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

NIPS Deep Learning Workshop (2014)

Sepp Hochreiter and Jurgen Schmidhuber

Long Short-Term Memory

Neural Computation 9(8) 1735–1780

Mengye Ren ImageQA March 31, 2015 24 / 27

Page 25: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

References

Andrea Frome and Gregory S. Corrado and Jonathon Shlens and Samy Bengio andJeffrey Dean and Marc’Aurelio Ranzato and Tomas Mikolov (2013)

DeViSE: A Deep Visual-Semantic Embedding Model

Advances in Neural Information Processing Systems 26: 27th Annual Conferenceon Neural Information Processing Systems 2013, 2121–2129

Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean (2013)

Efficient Estimation of Word Representations in Vector Space

http://arxiv.org/abs/1301.3781

Karen Simonyan and Andrew Zisserman (2014)

Very Deep Convolutional Networks for Large-Scale Image Recognition

http://arxiv.org/abs/1409.1556

Dan Klein and Christopher D. Manning (2003)

Accurate Unlexicalized Parsing

In proceedings of the 41st Annual Meeting of the Association for ComputationalLinguistics (2003) 423–430

Mengye Ren ImageQA March 31, 2015 25 / 27

Page 26: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

References

Alex Krizhevsky and Ilya Sutskever and Geoffrey E. Hinton (2012)

ImageNet Classification with Deep Convolutional Neural Networks

Advances in Neural Information Processing Systems 25: 26th Annual Conferenceon Neural Information Processing Systems 2012. 1106–1114.

Tsung-Yi Lin and Michael Maire and Serge Belongie and James Hays and PietroPerona and Deva Ramanan and Piotr Dollar and C. Lawrence Zitnick (2014)

Microsoft COCO: Common Objects in Context

Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proceedings, Part V 740–755

Nathan Silberman and Derek Hoiem and Pushmeet Kohli and Rob Fergus

Indoor Segmentation and Support Inference from RGBD Images

ECCV (2012)

Mengye Ren ImageQA March 31, 2015 26 / 27

Page 27: Image-Based Question Answering with Visual-Semantic Embeddingmren/slides/imageqa.pdf · Recent success in natural language proccessing (NLP). X t X t X t X t Y t Y 1 Y 1 Y 1 C t C

The End

Mengye Ren ImageQA March 31, 2015 27 / 27