VQA: Visual Question Answering · Visual Question Answering Given an input image and a...

25

VQA: Visual Question Answering Anmol Popli and Sparsh Gupta

Upload
others
Category

Documents
view
5
download
0

Embed Size (px):

Transcript of VQA: Visual Question Answering · Visual Question Answering Given an input image and a...

Page 1: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

VQA: Visual Question AnsweringAnmol Popli and Sparsh Gupta

Page 2: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 3: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 4: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Visual Question AnsweringGiven an input image and a natural-language question about the image, the task is to provide a natural-language answer as output.

Some key areas of VQA application are:-- Helping visually impaired users understand their surroundings

-- Helping intelligence analysts working on visual data

-- Efficient image retrieval for specific search queries

Page 5: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Visual Question AnsweringKey Challenges-- Multi-discipline problem - language understanding and vision understanding

-- Vision tasks: object detection, object recognition, counting, scene classification

Source: Antol et al. VQA: Visual Question Answering. IEEE ICCV 2015

Page 6: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 7: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Dataset● VQA Dataset (v 2.0) for VQA 2019 challenge

● Our work focuses on Balanced Real Images portion of the entire dataset

● 444k questions based on 83k images

● 10 annotations per question

www.visualqa.org https://visualqa.org/download.html

http://www.visualqa.org

https://visualqa.org/download.html

Page 8: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Preprocessing and Training Specifications● Resized images to 224 x 224● Chose only top 3000 most frequent answers● Converted text to lowercase, removed punctuations, and treated named

entities as UNK● Max length of question text capped to 15 words● Used GloVe embeddings to initialize word embeddings● Used Cross Entropy Loss for the 3000-class classification problem● Used Adam optimizer with scheduled learning rate decay● L2 regularization of weights

Page 9: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 10: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Joint Embedding Based Model

Source: Antol et al. VQA: Visual Question Answering. IEEE ICCV 2015

Page 11: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Joint Embedding Based Model - Examples

is the tv on or off? on what side of the plate is the fork on? left

what flavor is the cake? vanilla what is the man in the right wearing? wetsuit

Page 12: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 13: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model

Source: Kazemi et al. Show, ask, attend and answer: Strong Baseline for Visual Question Answering. arxiv.org/abs/1704.03162

Page 14: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 15: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 16: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 17: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Attention Visualization

How many planes are in the air? 1 What kind of collar is the man’s shirt? white

Page 18: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 19: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Bottom-Up and Top-Down Attention Based Model

Source: Anderson et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. arxiv.org/abs/1707.07998

Page 20: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Bottom-Up and Top-Down Attention Based Model● Faster-RCNN with ResNet-101 backend

● Regions of Interest (ROIs) obtained from Faster-RCNN

● Allows the model to compute attention for selected regions, and not the entire image

● Feature maps of each ROI are computed and are passed to the model

● Attention weight for each ROI feature map is computed

Page 21: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 22: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Multihop Attention Network● Image features from

Faster-RCNN for candidate ROIs

● Question features from LSTM

● Multiple layers of attention for attending to different regions of the image based on different words

● Aggregation - concatenation or addition

Page 23: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 24: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Results and Future Work

● Results obtained on validation data are at par with the results reported in respective papers

● Experimental model yet to be trained and tested

● Final test data predictions of experimental model to be submitted to challenge page

● Code will be made public shortly

Model Accuracy on Validation Data

Joint Embedding Based Model - VGG 16 49.28 %

Joint Embedding Based Model - ResNet 152 51.76 %

Vanilla Attention Based Model - VGG 16 55.35%

Vanilla Attention Based Model - ResNet 152 57.13%

Page 25: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Q & A

An Analysis of Visual Question Answering AlgorithmsAn Analysis of Visual Question Answering Algorithms Kushal Kaﬂe Christopher Kanan Rochester Institute of Technology Rochester,

An Analysis of Visual Question Answering AlgorithmsAn Analysis of Visual Question Answering Algorithms Kushal Kaﬂe Christopher Kanan Rochester Institute of Technology Rochester,

Knowledge Acquisition for Visual Question Answering via Iterative …openaccess.thecvf.com/content_cvpr_2017/poster/397_POSTER.pdf · image co-aIen6on for visual ques6on answering.

Knowledge Acquisition for Visual Question Answering via Iterative …openaccess.thecvf.com/content_cvpr_2017/poster/397_POSTER.pdf · image co-aIen6on for visual ques6on answering.

VQA: Visual Question Answering - arXiv · 1 VQA: Visual Question Answering Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi

VQA: Visual Question Answering - arXiv · 1 VQA: Visual Question Answering Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi

OCR-VQA: Visual Question Answering by Reading Text in Images · 2020-06-04 · dataset for studying visual question answering by reading text, i.e., OCR-VQA task.1 C. Visual question

OCR-VQA: Visual Question Answering by Reading Text in Images · 2020-06-04 · dataset for studying visual question answering by reading text, i.e., OCR-VQA task.1 C. Visual question

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

Deep Learning for Visual Question Answering · Deep Learning for Visual Question Answering Avi Singh avisingh599@gmail.com Abstract This project deals with the problem of Visual Question

Deep Learning for Visual Question Answering · Deep Learning for Visual Question Answering Avi Singh [email protected] Abstract This project deals with the problem of Visual Question

Open-Ended Visual Question-Answering · Open-Ended Visual Question-Answering A Degree Thesis Submitted to the Faculty of the Escola T ecnica Superior d’Enginyeria de Telecomunicaci

Open-Ended Visual Question-Answering · Open-Ended Visual Question-Answering A Degree Thesis Submitted to the Faculty of the Escola T ecnica Superior d’Enginyeria de Telecomunicaci

Image Question Answering: A Visual Semantic Embedding ... · Image Question Answering: A Visual Semantic Embedding ... an automatic question generation ... Image Question Answering:

Image Question Answering: A Visual Semantic Embedding ... · Image Question Answering: A Visual Semantic Embedding ... an automatic question generation ... Image Question Answering:

High-Order Attention Models for Visual Question Answering

High-Order Attention Models for Visual Question Answering

IQA: Visual Question Answering in Interactive … Visual Question Answering in Interactive Environments Daniel Gordon2 Aniruddha Kembhavi 1Mohammad Rastegari Joseph Redmon 2Dieter

IQA: Visual Question Answering in Interactive … Visual Question Answering in Interactive Environments Daniel Gordon2 Aniruddha Kembhavi 1Mohammad Rastegari Joseph Redmon 2Dieter

What is in that picture ? Visual Question Answering Systemcs229.stanford.edu/proj2017/final-reports/5243359.pdf · 1 Introduction Visual Question Answering is a complex task which

What is in that picture ? Visual Question Answering Systemcs229.stanford.edu/proj2017/final-reports/5243359.pdf · 1 Introduction Visual Question Answering is a complex task which

Medical Visual Question Answering via Conditional Reasoning

Medical Visual Question Answering via Conditional Reasoning

Explicit Bias Discovery in Visual Question Answering Models

Explicit Bias Discovery in Visual Question Answering Models

Using concept detectors to improve performance of visual ... · VISUAL QUESTION ANSWERING CHALLENGE 2 The Visual Question Answering challenge is set up by VirginiaTech and Mi-crosoft

Using concept detectors to improve performance of visual ... · VISUAL QUESTION ANSWERING CHALLENGE 2 The Visual Question Answering challenge is set up by VirginiaTech and Mi-crosoft

1 VQA: Visual Question Answering - Virginia Tech › ~santol › projects › vqa › iccv...1 VQA: Visual Question Answering Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu, Margaret

1 VQA: Visual Question Answering - Virginia Tech › ~santol › projects › vqa › iccv...1 VQA: Visual Question Answering Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu, Margaret

Check It Again:Progressive Visual Question Answering via ...

Check It Again:Progressive Visual Question Answering via ...

IQA: Visual Question Answering in Interactive Environments · 2018-09-24 · IQA: Visual Question Answering in Interactive Environments Daniel Gordon1 Aniruddha Kembhavi 2Mohammad

IQA: Visual Question Answering in Interactive Environments · 2018-09-24 · IQA: Visual Question Answering in Interactive Environments Daniel Gordon1 Aniruddha Kembhavi 2Mohammad

An Interpretable Multimodal Visual Question Answering ...

An Interpretable Multimodal Visual Question Answering ...

OK-VQA: A Visual Question Answering Benchmark Requiring ...openaccess.thecvf.com/content_CVPR_2019/papers/Marino_OK-VQA… · 2. Related Work Visual Question Answering (VQA). Visual

OK-VQA: A Visual Question Answering Benchmark Requiring ...openaccess.thecvf.com/content_CVPR_2019/papers/Marino_OK-VQA… · 2. Related Work Visual Question Answering (VQA). Visual

Heterogeneous Graph Learning for Visual Commonsense Reasoning · for image captioning and visual question answering. In multi-hop reasoning question answering task, a bi-directional

Heterogeneous Graph Learning for Visual Commonsense Reasoning · for image captioning and visual question answering. In multi-hop reasoning question answering task, a bi-directional

Languages

Pages

Legal

Copyright © 2022 FDOCUMENTS