GuessWhat?! Visual object discovery throughlcarin/Roy2.16.2018.pdf · GuessWhat?! Visual object...

GuessWhat?! Cooperative Visual Dialog Agents

GuessWhat?! Visual object discovery throughmulti-modal dialogue1

Learning Cooperative Visual Dialog Agentswith Deep Reinforcement Learning2

1 Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, Aaron Courville

2 Abhishek Das, Satwik Kottur, Jose M. F. Moura, StefanLee, Dhruv Batra

Presented by Ruiyi (Roy) Zhang

February 16th, 2018


Main ideas (GuessWhat?!)

Key contribution: The introduction of the GuessWhat?!dataset based on the MS COCO datasetDefine sub-tasks: the questioner, guesser and oracle tasksEstablish initial baselines of the introduced tasks

Is it a person?

Is it a snowboard?NoIs it the red one?Yes

Is it a cow? Yes

NoIs the cow on the left? No

On the right ? Yes

Is it an item being worn or held?

Is it the one being held by theperson in blue?

Yes First cow near us?

Is it the big cow in the middle?

Yes

YesNo

#203974 #168019

Figure: Two example games in the dataset.


a GuessWhat?! game

An image I ∈ RM×N containing a set of K segmentedobjects {O1, . . . , OK}.Each object Ok is assigned an object categoryck ∈ {1, . . . , C}The game further consists of a sequence of questions andanswers D = {q1, a1, . . . , qJ , aJ}, produced by thequestioner and oracle. Each answer aj ∈ {Yes, No, N/A}The oracle has access to the identity of the correct objectOcorrect, and the prediction of the questioner will bedenoted as Opredict.


The oracle task

Produce a yes-no answer for any object within a image given anatural language question.

Is

VGG16 VGG16

MLP

Yes/No/Not applicable

LSTM LSTM LSTM LSTM LSTM

CONTEXT CROP SPATIALINFORMATION

OBJECTCATEGORY

it a vase ?


The guesser task

Guesser Given an image I and a sequence of questions and answers DJ ,predict the correct object Ocorrect from the set of all objects O.

Questioner Given an image I and a sequence of T questions and answersD≤T , produce a new question qT+1.The Guesser model:

LSTM / HREDencoder

Is it a vase? Yes Is it partially visible? NoIs it in the left corner? NoIs it the turquoise and purple one? Yes

MLP MLP MLP

obj1

Softmax

Opredict

obj2 obj3 obj4

MLP

Figure: Overview of the guesser model for an image with 4 segmentedobjects. The weights are shared among the MLPs.


The questioner taskTrained by maximizing the conditional log-likelihood:

logP (Q|A, I) = logJ∏j=1

P (qj |q<j , a<j , I) = logJ∏j=1

Nj∏i=1

P (wji|wj<i, a≤j , I)

(1)

Encoder

VGG

a1

context context

a2

Is it a vase?

context context

w11 w12 w14

Decoder

Encoder Encoder Encoder

Is it partially visible?

q2q1

Is it in the left corner?

w11

w11

Decoder

Is it partially visible?

w14w12

w13

Yes No

VGG

Figure: HRED model conditioned on the VGG features of the image.Example over the third question given the first two questions, itsanswers and the image P (q2|q<2, a<2, I).


Oracle baseline results

Model Train err Val err Test errDominant class (no) 47.4% 46.2% 50.9%Question 40.2% 41.7% 41.2%Image 45.7% 46.7% 46.7%Crop 40.9% 42.7% 43.0%Question + Crop 22.3% 29.1% 29.2%Question + Image 37.9% 40.2% 39.8%Question + Category 23.1% 25.8% 25.7%Question + Spatial 28.0% 31.2% 31.3%Question + Category + Spatial 17.2% 21.1% 21.5%Question + Category + Crop 20.4% 24.4% 24.7%Question + Spatial + Crop 19.4% 26.0% 26.2%Question + Category + Spatial + Crop 16.1% 21.7% 22.1%Question + Spatial + Crop + Image 20.7% 27.7% 27.9%Question + Category + Spatial + Image 19.2% 23.2% 23.5%

Table: Classification errors for the oracle baselines.The best performing model is "Question + Category + Spatial"and refers to the MLP that takes the question, the selectedobject class and its spatial features as input.


Guesser and questioner baseline results

Model Train err Val err Test errHuman 9.0% 9.2% 9.2%Random 82.9% 82.9% 82.9%LSTM 27.9% 37.9% 38.7%HRED 32.6% 38.2% 39.0%LSTM+VGG 26.1% 38.5% 39.5%HRED+VGG 27.4% 38.4% 39.6%

Table: Classification errors for the guesser baselines.

Model ErrorHuman generated dialogue 38.7%QGen+GT 53.2%QGen+ORACLE 66.0%Random 82.9%

Table: Test error for the questioner (QGen) based on VGG+HREDguesser model. The accuracy error of the guesser model fed with thequestions from the questioner.


Main ideas (Cooperative Visual Dialog Agents)

I think we were talking about this image!

Two zebra are walking around their pen at the zoo.

Q1: Any people in the shot?

A1: No, there aren’t any.[0.1, -1, 0.2, … , 0.5]

Q10: Are they facing each other?

A10: They aren’t.[-0.5, 0.1, 0.7, … , 1]

A cooperative imageguessing game between twoagents Q-BOT and A-BOTis proposed.

Communication through anatural language dialog andthen Q-BOT select aparticular unseen imagefrom a lineup.

These agents are modeled asdeep neural networks andtrained end-to-end withreinforcement learning.


Model Overview

Are there any animals?

Yes, there are two elephants.

A-BOT

Question Encoder

AnswerDecoder

History Encoder

Fact EmbeddingQ-BOT

QuestionDecoder

Fact Embedding

Feature Regression

Network

History Encoder

Rou

nds

of D

ialo

g

[0.1, -2, 0, … , 0.57] Reward Function

Two agents: Q-BOT & A-BOTEnvironment: ImageAction:

Q-BOT: question qt Are there any animals?A-BOT: answer at Yes, there are two elephantsQ-BOT: image regression yt ∈ R4096

State:Q-BOT: sQt = [c, q1, a1, ..., qt−1, at−1]A-BOT: sAt = [I, c, q1, a1, ..., qt−1, at−1, qt]


Model Overview

Are there any animals?

Yes, there are two elephants.

A-BOT

Question Encoder

AnswerDecoder

History Encoder

Fact EmbeddingQ-BOT

QuestionDecoder

Fact Embedding

Feature Regression

Network

History Encoder

Rou

nds

of D

ialo

g

[0.1, -2, 0, … , 0.57] Reward Function

At each round t of dialog,Q-BOT generates a question qt from its question decoderconditioned on its state encoding SQt−1A-BOT encodes qt, updates its state encoding SAt , andgenerates an answer atBoth encode the completed exchange as FQt and FAtQ-BOT updates its state to SQt , predicts an imagerepresentation yt and receives a reward


Details


Joint Training with Policy Gradients

Rewards definition:

rt

(sQt︸︷︷︸

state

, (qt, at, yt)︸︷︷︸action

)= `

(yt−1, y

gt)︸︷︷︸

distance at t-1

− `(yt, y

gt)︸︷︷︸

distance at t

(2)

Objective functions:

minθA,θQ,θf

J(θA, θQ, θf ) , EπQ,πA

[T∑t=1

rt(sQt , (qt, at, yt)

)](3)

Policy Gradients:

∇θQJ = EπQ,πA

[rt (·) ∇θQ

log πQ(qt|sQt−1

)](4)

∇θAJ = EπQ,πA

[rt (·) ∇θA

log πA(at|sAt

)]. (5)

Feature regression network (θf ) receives gradient updates fordifferentiable l(·, ·)


Results of Q-BOT/A-BOT Interactions


Qualitative Retrieval Results

GuessWhat?! Visual object discovery throughlcarin/Roy2.16.2018.pdf · GuessWhat?! Visual object...

Documents

Transcript of GuessWhat?! Visual object discovery throughlcarin/Roy2.16.2018.pdf · GuessWhat?! Visual object...