Two Causal Principles for Improving Visual Dialog · Two Causal Principles for Improving Visual...

Two Causal Principles for Improving Visual Dialog

Jiaxin Qi∗1 Yulei Niu∗2 Jianqiang Huang1,3 Hanwang Zhang1

1Nanyang Technological University, 2Renmin University of China,3Damo Academy, Alibaba Group,

[email protected], [email protected], [email protected], [email protected]

Abstract

This paper unravels the design tricks adopted by us— the champion team MReaL-BDAI — for Visual DialogChallenge 2019: two causal principles for improving VisualDialog (VisDial). By “improving”, we mean that they canpromote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a ma-jor improvement is only due to our careful inspection onthe causality behind the model and data, finding that thecommunity has overlooked two causalities in VisDial. In-tuitively, Principle 1 suggests: we should remove the directinput of the dialog history to the answer model, otherwise aharmful shortcut bias will be introduced; Principle 2 says:there is an unobserved confounder for history, question, andanswer, leading to spurious correlations from training data.In particular, to remove the confounder suggested in Prin-ciple 2, we propose several causal intervention algorithms,which make the training fundamentally different from thetraditional likelihood estimation. Note that the two princi-ples are model-agnostic, so they are applicable in any Vis-Dial model. The code is available at https://github.com/simpleshinobu/visdial-principles.

1. Introduction

Given an image I , a dialog history of past Q&A pairs:H = {(Q1, A1), ..., (Qt−1, At−1)}, and the current t-thround question Q, a Visual Dialog (VisDial) agent [9] isexpected to provide a good answer A. Our community hasalways considered VQA [5] and VisDial as sister tasks dueto their similar settings: Q&A grounded by I (VQA) andQ&A grounded by (I,H) (VisDial). Indeed, from a tech-nical point view — just like the VQA models — a typ-ical VisDial model first uses encoder to represent I , H ,and Q as vectors, and then feed them into decoder for an-swer A. Thanks to the recent advances in encoder-decoderframeworks in VQA [22, 38] and natural language process-

*Equal contribution

Baseline Graph

𝑯 𝑸

𝑰 𝑽

𝑨

Our Graph

𝑯 𝑸

𝑰 𝑽

𝑨

𝑼

Principle 1

Principle 2

𝑯 𝑨

𝑯 𝑸 𝑨

𝑼

Figure 1. Causal graphs of VisDial models (baseline and ours).H: dialog history. I: image. Q: question. V : visual knowledge.A: answer. U : user preference. Shaded U denotes unobservedconfounder. See Section 3.2 for detailed definitions.

ing [39], the performance (NDCG [1]) of VisDial in litera-ture is significantly improved from the baseline 51.63% [2]to the state-of-the-art 64.47% [11].

However, in this paper, we want to highlight an impor-tant fact: VisDial is essentially NOT VQA with history! Andthis fact is so profound that all the common heuristics in thevision-language community — such as the multimodal fu-sion [38, 46] and attention variants [22, 25, 26] — cannotappreciate the difference. Instead, we introduce the use ofcausal inference [27, 28]: a graphical framework that standsin the cause-effect interpretation of the data, but not merelythe statistical association of them. Before we delve intothe details, we would like to present the main contributions:two causal principles, rooted from the analysis of the dif-ference between VisDial and VQA, which lead to a perfor-mance leap — a farewell to the 60%-s and an embrace forthe 70%-s — for all the baseline VisDial models1 in litera-ture [9, 21, 41, 26], promoting them to the state-of-the-artin Visual Dialog Challenge 2019 [2].

Principle 1 (P1): Delete link H → A.

Principle 2 (P2): Add one new (unobserved) node U andthree new links: U ← H , U → Q, and U → A.

Figure 1 compares the causal graphs of existing VisDialmodels and the one applied with the proposed two princi-ples. Although a formal introduction of them is given in

1Only those with codes&reproducible results due to resource limit.

1

arX

iv:1

911.

1049

6v2

[cs

.CV

] 2

Mar

202

0

https://github.com/simpleshinobu/visdial-principles

https://github.com/simpleshinobu/visdial-principles

Ranked A (Baseline) Ranked A (Baseline + P1)

1.No other vehicles

2.There are no animals

3.I don′t see any other

building

1.No

2.No other vehicles

3.Nope

1

2

3

4

5

6

7

8

0 2 4 6 8 10

Baseline Baseline + P1 GT

History's Answer Average Length

An

sw

er

Ave

rag

e L

en

gth

𝑸𝟕:Any other vehicles?

GT Answer: No other vehicles

H

𝑸𝟐:Is there any people?

𝑨𝟐:I don′t see any people

𝑸𝟑: 𝐀ny other motorcycles?

𝑨𝟑:No other motorcycles

𝑸𝟒: 𝐈s it night?

𝑨𝟒:It is either morning or near sunset

𝑸𝟓:𝐖hat color of motorcycles?

𝑨𝟓:Dark colored

𝑸𝟔:Is there trees?

𝑨𝟔:There are trees, in the background

𝑯𝟎:A motorcycle parked on the road site

𝑨𝟏:It is in color

𝑸𝟏:Is the photo in color?

(a) A Typical H → A Bias

H

𝑯𝟎:A person posing in front of a

mountain wearing ski gear

𝑨𝟏:probably a boy

𝑨𝟏:Jeans and T shirt

𝑨𝟏:Yes, he is

𝑼

𝑯 𝑸 𝑨

𝑸 𝑨

𝑼

1.No (1.0)

2.No, there are not

(0.8)

Questions about "he"

In this context, "he"

is the topic ...

1.Yes, he is (1.0)

2.Yes (0.6)I expect answers

about "he"...

In this context, I like

to ask "Are there ..."

and this question

type prefers ...

Are there any other

people?

𝑯

Backdoor: 𝑸 ← 𝑯 → 𝑼 → 𝑨

Backdoor: 𝑸 ← 𝑼 → 𝑨

𝑸𝟏:Is the he wearing sneakers?

𝑸𝟐:What is he wearing?

𝑸𝟏:Is the person a girl or boy?

(b) User Preference

Figure 2. The illustrative motivations of the two causal principles:(a) P1 and (b) P2.

Section 3.2, now you can simply understand the nodes asdata types and the directed links as modal transformations.For example, V → A and Q → A indicate that answer Ais the effect caused by visual knowledge V and question Q,through a transformation, e.g., a multi-modal encoder.

P1 suggests that we should remove the direct input of di-alog history to the answer model. This principle contradictsmost of the prevailing VisDial models [9, 15, 41, 26, 43, 16,11, 32], which are based on the widely accepted intuition:the more features you input, the more effective the model is.It is mostly correct, but only with our discretion of the datageneration process. In fact, the VisDial [9] annotators werenot allowed to copy from the previous Q&A, i.e., H 9 A,but were encouraged to ask consecutive questions includingco-referenced pronouns like “it” and “those”, i.e., H → Q,and thus the answer A is expected to be only based on ques-tion Q and reasoned visual knowledge V . Therefore, a goodmodel should reason over the context (I,H) with Q but notto memorize the bias. However, the direct path H → Awill contaminate the expected causality. Figure 2(a) showsa very ridiculous bias observed in all baselines without P1:the top answers are those with length closer to the averagelength in the history answers. We will offer more justifica-tions for P1 in Section 4.1.

P2 implies that the model training only based on the as-sociation between (I,H,Q) and A is spurious. By “spu-rious”, we mean that the effect on A caused by (I,H,Q)— the goal of VisDial — is confounded by an unobservedvariable U , because it appears in every undesired causalpath (a.k.a., backdoor [28]), which is an indirect causallink from input (I,H,Q) to output A: Q ← U → Aand Q ← H → U → A. We believe that such unob-

served U should be users as the VisDial dataset essentiallybrings humans in the loop. Figure 2(b) illustrates how theuser’s hidden preference confounds them. Therefore, dur-ing training, if we focus only on the conventional likelihoodP (A|I,H,Q), the model will inevitably be biased towardsthe spurious causality, e.g., it may score answer “Yes, heis” higher than “Yes”, merely because the users prefer tosee a “he” appeared in the answer, given the history contextof “he”. It is worth noting that the confounder U is moreimpactful in VisDial than in VQA, because the former en-courages the user to rank similar answers subjectively whilethe latter is more objective. A plausible explanation mightbe: VisDial is interactive in nature and a not quite correctanswer is tolerable in one iteration (i.e., dense prediction);while VQA has only one chance, which demands accuracy(i.e., one-hot prediction).

By applying P1 and P2 to the baseline causal graph, wehave the proposed one (the right one in Figure 1), whichserves as a model-agnostic roadmap for the causal infer-ence of VisDial. To remove the spurious effect caused byU , we use the do-calculus [28] P (A|do(I,H,Q)), whichis fundamentally different from the conventional likelihoodP (A|I,H,Q): the former is an active intervention, whichcuts off U → Q and H → Q, and sample (where the name“calculus” is from) every possible U |H , seeking the trueeffect on A only caused by (I,H,Q); while the latter like-lihood is a passive observation that is affected by the ex-istence of U . The formal introduction and details will begiven in Section 4.3. In particular, given the fact that oncethe dataset is ready, U is no longer observed, we propose aseries of effective approximations in Section 5.

We validate the effectiveness of P1 and P2 on the mostrecent VisDial v1.0 dataset. We show significant boosts (ab-solute NDCG) by applying them in 4 representative base-line models: LF [9] (↑16.42%), HCIAE [21] (↑15.01%),CoAtt [41] (↑15.41%), and RvA [26] (↑16.14%). Impres-sively, on the official test-std server, we use an ensemblemodel of the most simple baseline LF [9] to beat our 2019winning performance by 0.2%, a more complex ensembleto beat it by 0.9%, and lead all the single-model baselinesto the state-of-the-art performances.

2. Related WorkVisual Dialog. Visual Dialog [9, 10] is more interactive andchallenging than most of the vision-language tasks, e.g., im-age captioning [45, 44, 4] and VQA [5, 38, 37, 36]. Specif-ically, Das et al. [9] collected a large-scale free-form visualdialog dataset VisDial [7]. They applied a novel protocol:during the live chat, the questioner cannot see the pictureand asks open-ended questions, while the answerer givesfree-form answers. Another dataset GuessWhat?! proposedby [10] is a goal-driven visual dialog: questioner should lo-cate an unknown object in a rich image scene by asking a

2

sequence of closed-ended “yes/no” questions. We apply thefirst setting in this paper. Thus, the key difference is that theusers played an important role in the data collection process.

All of the existing approaches in the VisDial task arebased on the typical encoder-decoder framework [15, 12,33, 11, 32, 47]. They can be categorized by the usage ofhistory. 1) Holistic: they treat history as a whole to feed intomodels like HACAN [43], DAN [16] and CorefNMN [18].2) Hierarchical: they use a hierarchical structure to dealwith history like HRE [9]. 3) Recursive: RvA [26] usesa recursive method to process history. However, they alloverlook the fact that the history information should not bedirectly fed to the answer model (i.e., our proposed Prin-ciple 1). The baselines we used in this paper are LF [9]:the earliest model, HCIAE [21]: the first model to use his-tory hierarchical attention, CoAtt [41]: the first one to aco-attention mechanism, and RvA [26]: the first one for atree-structured attention mechanism.Causal Inference. Recently, some works [24, 6, 23, 34, 40]introduced causal inference into machine learning, trying toendow models the abilities of pursuing the cause-effect. Inparticular, we use the Pearl’s structural causal model (SCM)proposed by [28] to hypothesize the data generation pro-cess, which is a model-agnostic framework that reflects thenature of the data.

3. Visual Dialog in Causal GraphIn this section, we formally introduce the visual dialog

task and describe how the popular encoder-decoder frame-work follows the baseline causal graph shown in Figure 1.More details of causal graph can be found in [28, 29].

3.1. Visual Dialog Settings

Settings. According to the definition of VisDial taskproposed by Das et al. [9], at each time t, given in-put image I , current question Qt, dialog history H ={C, (Q1, A1), . . . , (Qt−1, At−1)}, where C is the imagecaption, (Qi, Ai) is the i-th round Q&A pair, and a list of100 candidate answers At = {A(1)

t , . . . , A(100)t }. A Vis-

Dial model is evaluated by ranking candidate answers At.Evaluation. Recently, the ranking metric Normalized Dis-counted Cumulative Gain (NDCG) is adopted by the Vis-Dial community [1]. It is different from the classificationmetric (e.g., top-1 accuracy) used in VQA. It is more com-patible with the relevance scores of the answer candidatesin VisDial rated by humans. NDCG requires to rank rele-vant candidates in higher places, rather than just to selectthe ground-truth answer.

3.2. Encoder-Decoder as Causal Graph

We first give the definition of causal graph, then revisitthe encoder-decoder framework in existing methods usingthe elements from the baseline graph in Figure 1.

Causal Graph. Causal graph [28], as shown in Figure 1,describes how variables interact with each other, expressedby a directed acyclic graph G = {N , E} consisting of nodesN and directed edges E (i.e., arrows). N denote variables,and E (arrows) denote the causal relationships between twonodes, i.e., A→ B denotes that A is the cause and B is theeffect, meaning the outcome of B is caused by A. Causalgraph is a highly general roadmap specifying the causal de-pendencies among variables.

As we will discuss in the following part, all of the ex-isting methods can be revisited in the view of the baselinegraph shown in Figure 1.Feature Representation and Attention in Encoder. Vi-sual feature is denoted as node I in the baseline graph,which is usually a fixed feature extracted by Faster-RCNN [31] based on ResNet backbone [13] pre-trained onVisual Genome [19]. For language feature, the encoderfirstly embeds sentence into word vectors, followed by pass-ing the RNN [14, 8] to generate features of question andhistory, which are denoted as {Q,H}.

Most of existing methods apply attention mecha-nism [42] in encoder-decoder to explore the latent weightsfor a set of features. A basic attention operation can be rep-resented as x = Att(X ,K) where X is the set of featuresneed to attend, K is the key (i.e., guidance) and x is the at-tended feature of X . Details can be found in most visualdialog methods [21, 41, 43]. In the baseline graph, the sub-graph {I → V,Q → V,H → Q → V } denotes a seriesof attention operations for visual knowledge V . Note thatthe implementation of the arrows are not necessarily inde-pendent, such as co-attention [41], and the process can bewritten as Input : {I,Q,H} ⇒ Output : {V }, where possi-ble intermediate variables can be added as mediator nodesinto the original arrows. However, without loss of general-ity, these mediators do not affect the causalities in the graph.Response Generation in Decoder. After obtaining the fea-tures from the encoder, existing methods will fuse them andfeed the fused ones into a decoder to generate an answer. Inthe baseline graph, node A denotes the answer model thatdecodes the fused features from {H→A,Q→A, V →A}and then transforms them into an answer sentence. In par-ticular, the decoder can be generative, i.e., to generate ananswer sentence using RNN; or discriminative, i.e., to selectan answer sentence by using candidate answer classifiers.

4. Two Causal Principles

4.1. Principle 1

When should we draw an arrow from one node point-ing to another? According to the definition in Section 3.2,the criterion is that if the node is the cause and the otherone is the effect. Intrigued, let’s understand P1 by dis-cussing the rationale behind the “double-blind” review pol-

3

icy. Given three variables: “Well-known Researcher” (R),“High-quality Paper” (P ), and “Accept” (A). From ourcommunity common sense, we know that R→P since topresearchers usually lead high-quality research, and P → Ais even more obvious. Therefore, for the good of the com-munity, the double-blind prohibits the direct link R → Aby author anonymity, otherwise the bias such as personalemotions and politics from R may affect the outcome of A.

The story is similar in VisDial. Without loss of general-ity, we only analyze the path H →Q→ A. If we inspectthe role of H , we can find that it is to help Q resolve someco-references like “it” and “their”. As a result, Q listens toH . Then, we use Q to obtain A. Here, Q becomes a medi-ator which cuts off the direct association between H and Athat makes P (A|Q,H) = P (A|Q), like the “High-qualityPaper” that we mentioned in the previous story. However,if we set an arrow from H to A: H → A, the undesirablebias of H will be learned for the prediction of A, that ham-pers the natural process of VisDial, such as the interestingbias illustrated in Figure 2(a). Another example is discussedin Figure 4 that A prefers to match the words in H , eventhough they are literally nonsense about Q if we add the di-rect link H → A. After we apply P1, these phenomena willbe relieved, such as the blue line illustrated in Figure 2(a),which is closer to the NDCG ground truth average answerlength , denoted as the green dashed line. Please refer toother qualitative studies in Section 6.5.

4.2. Principle 2

Before discussing P2, we first introduce an importantconcept in causal inference [28]. In causal graph, the fork-like pattern in Figure 3(a) contains a confounder U , whichis the common cause for Q and A (i.e., Q ←U→A). Theconfounder U opens a backdoor path started from Q, mak-ing Q and A spuriously correlated even if there is no directcausality between them.

In the data generation process of VisDial, we know thatnot only both of the questioner and answerer can see thedialog history, but also the answer annotators can look at thehistory when annotating the answer. Their preference afterseeing the history can be understood as a part of the humannature or subtleties conditional on a dialog context, and thusit has a causal effect on both Q and A. Moreover, due to thefact that the preference is nuanced and uncontrollable, weconsider it as an unobserved confounder for Q and A.

It is worth noting that the confounder hinders us to findthe true causal effect. Let’s take the graph in Figure 3(b)as an example. The causal effect from Q to A is 0; how-ever, we can quickly see that P (A|Q) − P (A) is nonzerobecause Q and A are both influenced by U and thus arecorrelated (thanks to Reichenbach’s common cause princi-ple [28]). That is, if we are given Q, the any likelihoodchange for A will be sensible compared to nothing is given.

𝑸 𝑨

𝑼

(a) Confounder U

𝑸 𝑨

𝑼

(b) Spurious Relation

𝑸 𝑨

𝑼

(c) do-operator

𝑸 𝑨

𝑼

𝑷𝒊(𝑨)

𝒔𝒊𝒒𝒕

(d) Question Type

𝑸 𝑨

𝑼

𝑷𝒊(𝑨)

𝒔𝒊

(e) Score Sampling

𝑸 𝑨𝑷𝒊(𝑨)

𝑫𝒖𝔼[𝑫𝒖]

(f) Hidden Dictionary

Figure 3. Example of confounder, do-operator and sketch causalgraphs of our three attempts to de-confounder

Therefore, if we consider P (A|Q) as our VisDial model, itwill still predict nonsense answers even if Q has nothing todo with A. As illustrated in Figure 2(b), model will preferthe candidates about “he” even though Q is not given, thatmeans it captures the confounder U but not the true ratio-nale between Q and A. Next, we will introduce a powerfultechnique that makes the Q and A in Figure 3(b) “indepen-dent”, i.e., no causal relation.

4.3. do-calculus.

The technique is do-calculus introduced in [28, 29].Specifically, do(Q = q) denotes that we deliberately as-sign a value q to variable Q (i.e., intervention), rather thanpassively observe Q = q. As illustrated in Figure 3(c),do(Q = q) can be understood as cutting all the originalincoming arrows to Q, and then making Q and U inde-pendent. Therefore, we can have the well-known back-door adjustment [28]: P (A|do(Q = q)) =

∑u P (A|Q =

q, u)P (u). Note that this is different from Bayes ruleP (A|Q = q) =

∑u P (A|Q = q, u)P (u|Q = q) thanks

to the independence P (u|Q = q) = P (u) introduced bydo-calculus. Let’s revisit Figure 3(b) by using do-calculus.We can find that P (A|do(Q = q)) − P (A) = 0, that is tosay, any intervention of Q will not influence the probabilityof A, meaning the correct relation between Q and A: nocausal relation. Therefore, P (A|do(Q = q)) should be theobjective answer model in VisDial.

For our proposed graph of VisDial shown in Figure 1, wecan use intervention do(Q,H, I) and the backdoor adjust-ment to obtain our overall model. Here, we slightly abusethe notation do(Q,H, I) as do(Q = q,H = h, I = i):

P (A|do(Q,H, I))

=∑

uP (A|do(Q,H, I), u)P (u|do(Q,H, I))

=∑

uP (A|do(Q), H, I, u)P (u|H)

=∑

uP (A|Q,H, I, u)P (u|H).

(1)

4

The detailed derivation and proof can be found in supple-mentary materials.

So far, we have provided all the ingredients of the base-line causal graph, two proposed principles and their theo-retical solution: do-calculus. Next, we will introduce someimplementations for the proposed solution in Eq. (1).

5. Improved Visual Dialog Models

It is trivial to implement P1 and we will provide its train-ing details in Section 6.3. For P2, since U is unobserved, itis impossible to sample u in Eq. (1) directly. Therefore, ourtechnical contribution is to introduce 3 approximations. Fornotation simplicity, we first re-write Eq. (1) as:

P (A|do(Q,H, I)) =∑

uPu(A)P (u|H), (2)

where Pu(A) := P (A|Q,H, I, u).

5.1. Question Type

Since we cannot directly sample u from the unobservedconfounder, we use the i-th answer candidate ai as a del-egate for sample u. That is because ai is a sentence ob-served from the “mind” of user u during dataset collec-tion. Then,

∑u Pu(A)P (u|H) can be approximated as∑

i Pi(A)P (ai|H). We further use p(ai|QT ) to approxi-mate P (ai|H) because of two reasons: First, P (ai|H) es-sentially describes a prior knowledge about ai without com-prehending the whole {Q,H, I} triplet. A similar scenariois that if we know the QT (question type), e.g., “what color”,the answer candidates denoting colors have higher probabil-ities without even comprehending the question details. Sec-ond, QT is extracted from question Q, which is a descen-dent of history H in our graph, indicating that QT partiallyreveals H [28]. In practice, we manually define some ques-tion types, each of which has a certain answer frequency.For each dialog round, a normalized score sqti := p(ai|QT )(i.e.,

∑i s

qti = 1) of each candidate ai will be calculated ac-

cording to the frequency of ai under question type qt. Moredetails are given in Section 6.3. Finally, we have the ap-proximation for Eq. (2):

∑uPu(A)P (u|H) ≈

∑iPi(A) · sqti , (3)

where Pi(A) = softmax(fs(ei,m)), fs is a similarity func-tion, ei is the embedding of candidate ai, m is the jointembedding for {Q, I,H}, and the sketch graph is shownin Figure 3(d). Since question type is observed from Q,the approximation p(ai|QT ) undermines the prior assump-tion of the backdoor adjustment in Eq. (1) (i.e., the priorp(u|H) cannot be conditional on Q). Fortunately, QT isonly a small part of Q (i.e., the first few words) and thus theapproximation is reasonable.

5.2. Answer Score Sampling

Since the question type implementation slightly under-mines the backdoor adjustment, we will introduce a bet-ter approximation which directly samples from u: AnswerScore Sampling. This implementation is also widely knownas our previously proposed dense fine-tune in commu-nity [3].

We still use ai to approximate u, and we use the (nor-malized) ground-truth NDCG score si annotated by the hu-mans to approximate P (ai|H). Note that si directly revealshuman preference for ai in the context H (i.e., the priorP (ai|H)). In practice, we use the subset of training set withdense annotations to sample si. Therefore, we have:

∑uPu(A)P (u|H) ≈

∑iPi(A) · si, (4)

and the sketch graph is illustrated in Figure 3(e). In practice,Eq. (4) can be implemented using different loss functions.Here we give three examples:Weighted Softmax Loss (R1). We extend the log-softmaxloss as a weighted form, where Pi(A) is denoted bylog(softmax(pi)), pi denotes the logit of candidate ai, andsi is corresponding normalized relevance score.Binary Sigmoid Loss (R2). This loss is close tothe binary cross-entropy loss, where Pi(A) representslog(sigmoid(pi)) or log(sigmoid(1−pi)), and si representscorresponding normalized relevance score.Generalized Ranking Loss (R3). Note that the an-swer generation process can be viewed as a ranking prob-lem. Therefore, we derive a ranking loss that Pi(A) islog exp(pi)

exp(pi)+∑

j∈G exp(pj), where G is a group of candidates

which have lower relevance scores than candidate ai andsi is normalized characteristic score (i.e., equals to 0 for aiwith relevance score 0 and equals to 1 for ai with positiverelevance score).

More details of the three loss functions are given in sup-plementary materials. It is worth noting that our losses arederived from the underlying causal principle P2 in Eq. (4),but not from the purpose of regressing to the ground-truthNDCG. The comparison will be given in Section 6.4.

5.3. Hidden Dictionary Learning

The aforementioned two implementations are discretesince they sample specific ai to approximate u. For bet-ter approximation, we propose learning to approximate theunobserved confounder U . As shown in Figure 3(f), wedesign a dictionary to model U . In practice, we designthe dictionary as a N × d matrix Du, where N is manu-ally set and d is the hidden feature dimension. Note thatgiven a sample u and a answer candidate ac, Eq.(2) can beimplemented as

∑u Pu(ac)P (u|H). Since the last layer

of our network for answer prediction is a softmax layer:

5

Pu(ac) = softmax(fs(ec,u,m)), where ec is the embed-ding of candidate ac, u is sampled from Du, m is the jointembedding for {Q, I,H}, and fs is a similarity computa-tion function, the Eq.(2) can be re-written as:

P (A|do(Q,H, I)) := E[u|H] [softmax(fs(ec,u,m))] .(5)

Since Eq. (5) needs expensive samplings for u, we useNWGM approximation [42, 35] to efficiently move the ex-pectation into the softmax:

E[u|H][softmax(fs(ec,u,m))]≈softmax(E[u|H][fs(ec,u,m)]),(6)

The details of the NWGM approximation can be foundin supplementary materials. In this paper, we modelfs(ec,u,m) = eTc (u + m). Thanks to the linearadditive property of expectation calculation, we can useeTc (E[u|H][Du] + m) to calculate E[u|H][e

Tc (u + m)].

In practice, we use a dot-product attention to computeE[u|H][Du]. Specifically, E[u|H][Du] = softmax(LTK)�Du, where L = W1h, K = W2Du and � is element-wise product, h is the embedding of history H , andW1,W2 are mapping matrices. The training details canbe found in Section 6.3.

6. Experiments6.1. Experimental Setup

Dataset. Our proposed principles are evaluated on the re-cently released real-world dataset VisDial v1.0. Specifi-cally, the training set of VisDial v1.0 contains 123K im-ages from COCO dataset [20] with a 10-round dialog foreach image, resulting in totally 1.2M dialog rounds. Thevalidation and test sets were collected from Flickr, with 2Kand 8K COCO-like images respectively. The test set is fur-ther split into test-std and test-challenge splits, both withthe number of 4K images that are hosted on the blind onlineevaluation server. Each dialog in the training and validationsets has 10 rounds, while the number in test set is uniformlydistributed from 1 to 10. For each dialog, a list of 100 an-swer candidates is given for evaluation. In the following,the results are reported on the validation and test-std set.Metrics. As mentioned in Section 3.1, NDCG is recom-mended by the official and accepted by the community.There are some other retrieval-based metrics like MRR(Mean Reciprocal Rank), where the ground-truth answeris generated by the single user. Note that the only answermay easily influenced by the single user’s preference (i.e.,length). We argue that this may be the reason why the mod-els with history shortcut achieve higher MRR, (e.g., due tothe bias illustrated in Figure 2) and lower NDCG. There-fore, retrieval-based metrics are not consistent with NDCG.According to the mentioned reasons and space limitation,we only show the results on NDCG in the main paper. For

completeness, the further discussion between NDCG andother retrieval-based metrics and the performance on allmetrics will be given in the supplementary materials.

6.2. Model Zoo

We report the performance of the following base models,including LF [9], HCIAE [21], CoAtt [41] and RvA [26]:LF [9]. This naive base model has no attention mod-ule. We expand the model by adding some basic attentionoperations, including question-based history attention andquestion-history-based visual attention refinement.HCIAE [21]. The model consists of question-based historyattention and question-history-based visual attention.CoAtt [41]. The model consists of question-based visualattention, image-question-based history attention, image-history-based question attention, and the final question-history-based visual attention.RvA [26]. The model consists of question-based visual at-tention and history-based visual attention refinement.

6.3. Implementation Details

Pre-processing. For language pre-processing, we followedthe process introduced by [9]. First, we lowercased all theletters in sentences, converted digits to words and removedcontractions. After that, we used Python NLTK toolkit totokenize sentences into word lists, followed by padding ortruncating captions, questions, and answers to the lengthof 40, 20 and 20, respectively. Then, we built a vocabu-lary of the tokens with the size of 11,322, including 11,319words that occur at least 5 times in train v1.0 and 3 instruc-tion tokens. We loaded the pre-trained word embeddingsfrom GloVe [30] to initialize all word embeddings, whichwere shared in encoder and decoder, and we applied 2-layers LSTMs to encode word embedding and set their hid-den state dimension to 512. For the visual feature, we usedbottom-up-attention features [4] given by the official [1].Implementation of Principles. For Principle 1 (P1), weeliminated the history feature in the final fused vector repre-sentation for all models, while kept other parts unchanged.For HCIAE [21] and CoAtt [41], we also blocked the his-tory guidance to the image. For Principle 2 (P2), wetrained the models using the preference score, which canbe counted from question type or given by the official (i.e.,dense annotations in VisDial v1.0 training set). Specifically,for “question type”, we first defined 55 types and markedanswers occurred over 5 times as preferred answers, thenused the preference to train our model by (R2) loss pro-posed in Section 5.2. “Answer score sampling” was directlyused to fine-tune our pre-trained model by the proposed lossfunctions. For “hidden dictionary”, we set a matrix for Nas 100 and d as 512 to realize Du. The dictionary is ini-tialized with the features of top-100 popular answers, thentrained by dense annotations with R3 loss. More details can

6

Model baseline QT S DR0 R1 R2 R3

LF [9] 57.21 58.97 67.82 71.27 72.04 72.36 72.65LF +P1 61.88 62.87 69.47 72.16 72.85 73.42 73.63

Table 1. Performance (NDCG%) comparison for the experimentsof applying our principles on the validation set of VisDial v1.0.LF is the enhanced version as we mentioned. QT, S and D de-note question type, answer score sampling, and hidden dictionarylearning, respectively. R0, R1, R2, R3 denote regressive loss,weighted softmax loss, binary sigmoid loss ,and generalized rank-ing loss, respectively.

be found in supplementary materials. Note that the imple-mentations following P1 and P2 are flexible.Training. We used softmax cross-entropy loss to train themodel with P1, and used Adam [17] with the learning rateof 4× 10−3 which decayed at epoch 5, 7, 9 with the decayrate of 0.4. The model was trained for 15 epochs totally.In addition, Dropout [35] was applied with ratio of 0.4 forRNN and 0.25 for fully connected layers. Other settingswere set by default.

6.4. Quantitative Results

Table 1 shows the results with different implementationsin P2, i.e., question type, answer score sampling, and hid-den dictionary learning. Overall, all of the implementationscan improve the performances of base models. Specifically,the implementations of P2 can further boost performance byat most 11.75% via hidden dictionary learning. Speifically,our designed loss functions based on Eq. (2) outperform theregressive score, which is implemented as Euclidean dis-tance loss and denoted as R0. The reason is that the regres-sion fine-tune strategy is not a proper approximation for P2.We also find that the proposed ranking loss (i.e., R3) per-forms best since it satisfies the ranking property of VisDial.

Note that our principles are model-agnostic. Table 2shows the results about applying our principles on four dif-ferent models (i.e., LF [9], HCIAE [21], CoAtt [41] andRvA [26]). In general, both of our principles can im-prove all the models in any ablative condition (i.e., P1, P2,P1+P2). Note that the effectiveness of P1 and P2 are addi-tive, which means combining P1 and P2 performs the best.

We finally used the blind online test server to justify theeffectiveness of our principles on the test-std split of Vis-Dial v1.0. As shown in Table 3, the top part contains theresults of the baseline models implemented with our princi-ples, while the bottom part represents the recent Visual Dia-log Challenge 2019 leaderboard [2]. We used the ensembleof the enhanced LF [9] to beat the Visual Dialog Challenge2019 Winner (i.e., MReaL-BDAI), which can also be re-garded as the implementations of P1 and P2. Promisingly,by applying our principles, we can promote all the baselinemodels to the top ranks on the leaderboard.

Model LF [9] HCIAE [21] CoAtt [41] RvA [26]baseline 57.21 56.98 56.46 56.74+P1 61.88 60.12 60.27 61.02+P2 72.65 71.50 71.41 71.44+P1+P2 73.63 71.99 71.87 72.88

Table 2. Performance (NDCG%) of ablative studies on differentmodels on VisDial v1.0 validation set. P2 indicates the most effec-tive one (i.e., hidden dictionary learning) shown in Table 1. Notethat only applying P2 is implemented by the implementations inSection 5 with the history shortcut.

𝑯𝟎: vintage black steam train stopped

on tracks in the countryside

H

𝑸𝟏:Is it day time?𝑨𝟏:Yes

𝑸𝟐:Can you see the sky?𝑨𝟐:Yes

𝑸𝟑:Are there clouds?

𝑨𝟑:Yes

Q:Is it raining?

GT Answer: No

Ranked A Ranked A

1.no(0.6)

2.yes(0)

3.0(0)

4.n o(0.6)

5.no cloudy(0.2)

1.no(0.6)

2.no but clouds(0.2)

3.it is not raining(0.8)

4.no it isn't(0.6)

5.no it's not(0.6)

Baseline + P1Baseline

NDCG:0.57 NDCG:0.78

Q: Is there a window by the ledge?

GT Answer: No can see a blue sink

in background

H

𝑸𝟏:Are the toilets toys?𝑨𝟏:More like ashtrays

𝑸𝟐:Are they all the same color?

𝑨𝟐:Yes white

𝑸𝟑:Are any of them used?𝑨𝟑: No

𝑸𝟒: What color is the counter?

𝑨𝟒: Sitting on wood ledge

𝑸𝟓: Is there anything else on the ledge?𝑨𝟓:No

𝑯𝟎:4 little toilets sit on the counter by

a sink

Ranked A

Baseline

Ranked A

NDCG:0.44 NDCG:0.66

1.no(0.8)

2.no can see a blue sink

in background(0)

3.no it look like a wall(0)

4.cannot tell(0)

5.can't tell(0)

6.no just a sink(0.2)

1.i can't see any

window(0.6)

2.no(0.8)

3.i can't see 1(1.0)

4.not that i can see(0.2)

5.not visible(0.6)

6.i cant see 1(0.4)

Baseline + P1

Figure 4. Qualitative results of the baseline and baseline with P1on the validation set of VisDial v1.0. The numbers in brackets inranked A denote relevance scores. Red boxes denote that the se-lected candidates of the baseline model influenced by the shortcut(e.g., word matching) from the dialog history. For the baselinewith P1, it does not make such biased shortcut choices. More de-tails can be found in Section 6.5.

6.5. Qualitative Analysis

The qualitative results illustrated in Figure 4 and Figure 5show the following advantages of our principles.History Bias Elimination. After applying P1, the harmfulpatterns learned from history are relieved, like the answer-length bias shown in Figure 2(a) as we mentioned. The ex-ample on the top of Figure 4 shows the word-match biasin baseline. From this example, we can observe that theword “sink” from history is literally unrelated to the currentquestion, However, for the baseline model, some undesir-able candidates (i.e., with low relevance score) containingthe word “sink” can be found in the top ranked answersdue to the wrong direct history shortcut. To further confirmthe conjecture, we counted the word-match cases of objec-tive words (e.g., “sink” and “counter”) on the validation setfor the top-10 candidates of the ranked lists. The statis-tic indicates that P1 can decrease about 10% word match-ing from history (from ∼5200 times of baseline to ∼4800times using P1). The bottom example shows that, when theanswer “yes” exists in history, baseline model will tend torank “yes” in a high place. However, it is opposite to thereal answer “no” in some cases, which will lead to a lower

7

key rank increase:

NDCG: 0.38 0. 96

"nope"(0.8):25 2

Baseline Baseline + P2

0.6

0

0

0

0

0

0

0

0

1

no, i can't tell

no it doesn't look like that

yes

each photo has a…

no computer is in the…

it looks like a living area

looks like a meeting room

it's like a meeting room

wrestling 1 another

no

relevance score

0.4

0.6

0.6

0.4

0.8

0.6

0.8

1

0.8

1

not in the…

can't tell

i can't tell

i cannot tell

no you…

no, i can't tell

not that i can…

no, i can't

nope

no

relevance score

54 4"not that i can see"(0.8):"no, i can't"(1.0):10 3

key rank increase:

NDCG: 0.46 0. 92

"yes he is"(1.0): 6 2


0

0

0

0

0

1

0.4

1

1

1

maybe

i can't see

is like that

a toboggan

no

yes,

i think so

yes, he is

yes he is

yes

relevance score

0

1

1

0

1

0

0

0

0

1

no he's not wearing any…

yes,

yes, he is

he is not wearing any gear

yes he is

no but he wearing a…

it's hard to say

a toboggan

no

yes

relevance score

"yes, he is"(1.0):8 3 "yes, "(1.0):9 5

𝑯𝟎:A man stands and watches as a woman with

a remote channel surfs from a couch

𝑸𝟐:Is there only the man and woman?

𝑨𝟐:no

𝑸𝟑:How many people in total are there?

𝑨𝟑:4

𝑸𝟒:What color is the coach?

𝑨𝟒:plaid

𝑸𝟓:Are all the people on the same couch?

𝑨𝟓:3 of them are on couch

H

Q:Can your see what they are watching?

GT Answer: No

𝑨𝟏:yes, want know how many people

𝑸𝟏:Does it look like there in a living room?

Q:Is he wearing the skis?

GT Answer: Yes

𝑯𝟎:A person posing in front of a mountain

wearing ski gear

𝑸𝟐:What is he wearing?

𝑨𝟐:White pants and black jacket

𝑸𝟑:Does he have goggles on?

𝑨𝟑:Yes

𝑸𝟒:Is there snow on the mountain?

𝑨𝟒:Yes

𝑸𝟓:Does he have skis?𝑨𝟓:Yes

𝑸𝟔:Does he have poles?

𝑨𝟑:Yes, he does

H

𝑨𝟏:Probably a boy

𝑸𝟏:Is the person a girl or boy?

Figure 5. Qualitative examples of the ranked candidates of baseline and baseline with P2. We also give some key rank changes for boostingNDCG performance by implementing P2. These examples are taken from the validation set of VisDial v1.0.

Model NDCG(%)

Ours

P1+P2 (More Ensemble) 74.91LF+P1+P2 (Ensemble) 74.19

LF+P1+P2 (single) 71.60RvA+P1+P2 (single) 71.28

CoAtt+P1+P2 (single) 69.81HCIAE+P1+P2 (single) 69.66

Leaderboard

VD-BERT(Ensemble)∗ 75.13Tohuku-CV Lab (Ensemble)∗ 74.88

MReaL-BDAI∗ 74.02SFCU (Single)∗ 72.80

FancyTalk (HeteroFM)∗ 72.33Tohuku-CV Lab (Ensemble w/o ft)∗ 66.53

Table 3. Our results and comparisons to the recent Visual DialogChallenge 2019 Leaderboard results on the test-std set of VisDialv1.0. Results are reported by the test server, (∗) denotes it is takenfrom [2]. Note that the top five models in the Leaderboard use thedense fine-tune implementation illustrated in Section 5.2.

NDCG. After applying P1, this problem can be effectivelyalleviated. To testify this conclusion, we further calculatethe average ranks of “yes” for baseline and baseline withP1 in the above case (i.e., “yes” appears in history and realanswer is “no”). We find that the average ranks are 4.82for baseline and 6.63 for baseline with P1 respectively. Thelower rank means that P1 relieves the “yes” shortcut in his-tory. More examples of these biases can be found in supple-mentary materials.More Reasonable Ranking. Figure 5 shows that the base-line model only focuses on ground truth answers like “no”or “yes” and does not care about the rank of other answerswith similar meaning like “nope” or “yes, he is”. This doesnot match human’s intuition because the candidates withsimilar semantics are still reasonable. This also leads the

baseline model to a lower NDCG. As shown in Figure 5the model with P2 almost ranks all the suitable answers like“yes, he is”, “yes he is” and “I think so” at top places to-gether with the ground truth answer “yes”, which signifi-cantly improves the NDCG performance.

7. Conclusions

In this paper, we proposed two causal principles forimproving the VisDial task. They are model-agnostic, andthus can be applied in almost all the existing methods andbring major improvement. The principles are drawn fromour in-depth causal analysis of the VisDial nature, which ishowever unfortunately overlooked by our community. Fortechnical contributions, we offered some implementationson how to apply the principles into baseline models. Weconducted extensive experiments on the official VisDialdataset and the online evaluation servers. Promising resultsdemonstrate the effectiveness of the two principles. Asmoving forward, we will stick to our causal thinking todiscover other potential causalities hidden in embodiedQ&A and conversational visual dialog tasks.

Acknowledgement This work was partially supportedby National Natural Science Foundation of China(61573363 and 61832017), the Fundamental ResearchFunds for the Central Universities and the ResearchFunds of Renmin University of China (15XNLQ01),and NTU-Alibaba JRI. We would like to thank theanonymous reviewers for their constructive comments.

8

References[1] Visual Dialog. https://visualdialog.org/. 1, 3, 6[2] Visual Dialog Challenge 2019 Leaderboard. https:

//evalai.cloudcv.org/web/challenges/challenge-page/161/leaderboard/483/. 1, 7, 8

[3] Visual Dialog Challenge 2019 Winner Report. https://drive.google.com/file/d/1fqg0hregsp_3USM6XCHx89S9JLXt8bKp. 5

[4] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages6077–6086, 2018. 2, 6

[5] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, MargaretMitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh.Vqa: Visual question answering. In Proceedings of the IEEEinternational conference on computer vision, pages 2425–2433, 2015. 1, 2

[6] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, RosemaryKe, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal,and Christopher Pal. A meta-transfer objective for learningto disentangle causal mechanisms. In 8rd International Con-ference on Learning Representations, ICLR, 2020. 3

[7] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling.Amazons mechanical turk: A new source of inexpensive, yethigh-quality, data? Perspectives on Psychological Science,6(1):3–5, 2011. 2

[8] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau,and Yoshua Bengio. On the properties of neural machinetranslation: Encoder-decoder approaches. In Proceedings ofSSST@EMNLP 2014, Eighth Workshop on Syntax, Seman-tics and Structure in Statistical Translation, pages 103–111,2014. 3

[9] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh,Deshraj Yadav, Jose MF Moura, Devi Parikh, and Dhruv Ba-tra. Visual dialog. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 326–335,2017. 1, 2, 3, 6, 7

[10] Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron Courville. Guess-what?! visual object discovery through multi-modal dia-logue. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 5503–5512, 2017. 2

[11] Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, JingjingLiu, and Jianfeng Gao. Multi-step reasoning via recurrentdual attention for visual dialog. In Proceedings of the 57thConference of the Association for Computational Linguis-tics, ACL 2019, pages 6463–6474, 2019. 1, 2, 3

[12] Dalu Guo, Chang Xu, and Dacheng Tao. Image-question-answer synergistic network for visual dialog. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 10434–10443, 2019. 3

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 3

[14] Sepp Hochreiter and Jurgen Schmidhuber. Long short-termmemory. Neural computation, 9(8):1735–1780, 1997. 3

[15] Unnat Jain, Svetlana Lazebnik, and Alexander G Schwing.Two can play this game: visual dialog with discriminativequestion generation and answering. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 5754–5763, 2018. 2, 3

[16] Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dualattention networks for visual reference resolution in visualdialog. In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing,EMNLP-IJCNLP, pages 2024–2033, 2019. 2, 3

[17] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In 3rd International Conference onLearning Representations, ICLR, 2015. 7

[18] Satwik Kottur, Jose MF Moura, Devi Parikh, Dhruv Batra,and Marcus Rohrbach. Visual coreference resolution in vi-sual dialog using neural module networks. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 153–169, 2018. 3

[19] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome:Connecting language and vision using crowdsourced denseimage annotations. International Journal of Computer Vi-sion, 123(1):32–73, 2017. 3

[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755.Springer, 2014. 6

[21] Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, andDhruv Batra. Best of both worlds: Transferring knowledgefrom discriminative learning to a generative visual dialogmodel. In Advances in Neural Information Processing Sys-tems, pages 314–324, 2017. 1, 2, 3, 6, 7

[22] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.Hierarchical question-image co-attention for visual questionanswering. In Advances In Neural Information ProcessingSystems, pages 289–297, 2016. 1

[23] Divyat Mahajan, Chenhao Tan, and Amit Sharma. Preserv-ing causal constraints in counterfactual explanations for ma-chine learning classifiers. In Advances in Neural InformationProcessing Systems Workshop on Causal Machine Learning,2019. 3

[24] Suraj Nair, Yuke Zhu, Silvio Savarese, and Li Fei-Fei.Causal induction from visual observations for goal directedtasks. In Advances in Neural Information Processing Sys-tems Workshop on Causal Machine Learning, 2019. 3

[25] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dualattention networks for multimodal reasoning and matching.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 299–307, 2017. 1

[26] Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang,Zhiwu Lu, and Ji-Rong Wen. Recursive visual attentionin visual dialog. In Proceedings of the IEEE Conference

9

https://visualdialog.org/

https://evalai.cloudcv.org/web/challenges/challenge-page/161/leaderboard/483/



https://drive.google.com/file/d/1fqg0hregsp_3USM6XCHx89S9JLXt8bKp



on Computer Vision and Pattern Recognition, pages 6679–6688, 2019. 1, 2, 3, 6, 7

[27] Judea Pearl et al. Causal inference in statistics: An overview.Statistics surveys, 3:96–146, 2009. 1

[28] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.Causal inference in statistics: A primer. John Wiley & Sons,2016. 1, 2, 3, 4, 5

[29] Judea Pearl and Dana Mackenzie. THE BOOK OF WHY:THE NEW SCIENCE OF CAUSE AND EFFECT. BasicBooks, 2018. 3, 4

[30] Jeffrey Pennington, Richard Socher, and Christopher Man-ning. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods innatural language processing (EMNLP), pages 1532–1543,2014. 6

[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In Advances in neural information pro-cessing systems, pages 91–99, 2015. 3

[32] Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander GSchwing. Factor graph attention. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2039–2048, 2019. 2, 3

[33] Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, andLeonid Sigal. Visual reference resolution using attentionmemory for visual dialog. In Advances in neural informa-tion processing systems, pages 3719–3729, 2017. 3

[34] Rahul Singh and Liyang Sun. De-biased machine learningfor compliers. In Advances in Neural Information ProcessingSystems Workshop on Causal Machine Learning, 2019. 3

[35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simple wayto prevent neural networks from overfitting. The journal ofmachine learning research, 15(1):1929–1958, 2014. 6, 7

[36] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, andHanwang Zhang. Unbiased scene graph generation from bi-ased training. In Conference on Computer Vision and PatternRecognition, 2020. 2

[37] Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo,and Wei Liu. Learning to compose dynamic tree structuresfor visual contexts. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 6619–6628, 2019. 2

[38] Damien Teney, Peter Anderson, Xiaodong He, and Antonvan den Hengel. Tips and tricks for visual question an-swering: Learnings from the 2017 challenge. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4223–4232, 2018. 1, 2

[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In Advances in neuralinformation processing systems, pages 5998–6008, 2017. 1

[40] Tan Wang, Jianqiang Huang, Hanwang Zhang, and QianruSun. Visual commonsense r-cnn. In Conference on Com-puter Vision and Pattern Recognition, 2020. 3

[41] Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Antonvan den Hengel. Are you talking to me? reasoned visual

dialog generation through adversarial learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 6106–6115, 2018. 1, 2, 3, 6, 7

[42] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, AaronCourville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image caption gen-eration with visual attention. In International conference onmachine learning, pages 2048–2057, 2015. 3, 6

[43] Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. Mak-ing history matter: Gold-critic sequence training for visualdialog. In Proceedings of the IEEE International Conferenceon Computer Vision, 2019. 2, 3

[44] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.Auto-encoding scene graphs for image captioning. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 10685–10694, 2019. 2

[45] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Explor-ing visual relationship for image captioning. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 684–699, 2018. 2

[46] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learningfor visual question answering. In Proceedings of the IEEEinternational conference on computer vision, pages 1821–1830, 2017. 1

[47] Zilong Zheng, Wenguan Wang, Siyuan Qi, and Song-ChunZhu. Reasoning visual dialogs with structural and partial ob-servations. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 6669–6678,2019. 3

10

Supplementary MaterialThis supplementary material will give further details for

the main paper, including A. More basic knowledge ofcausal graph for a better understanding, B. What the con-founder brings to us and why we use do, C. Proofs anddetails which are omitted in the main paper due to spacelimitation, D. More qualitative examples to testify the ef-fectiveness, E. The whole tables including other metrics.

A. Basic Knowledge of Causal GraphA.1. Causal Graph

The basic definition of causal graph is introduced in themain paper. Here, we introduce more details. The mostnaive configuration is X → Y , which denotes X causes Y ,or Y listens to X . This directed path from X to Y is calledcausal path, which denotes X’s causal effect on Y . In thereal world, what we want to know is the causal effect amongvariables, not just co-occurrence.

For easy to understand the theories we will introduce lat-ter, we start from the simple causal graph configurations.There are three basic configurations in causal graph. 1)Chain—one arrow directed into and one arrow directed outof the middle variable—is shown in Figure 1(a). 2) Fork—two arrows emanating from the middle variable—is shownin Figure 1(b). 3) Collider—the middle variable receiv-ing arrows from two other nodes—is like the configurationX → Z ← Y , which is not shown in the picture becausewe will not use it.

A.2. Conditional independence

We introduce the dependency between variables incausal graph in this section. Using Chain shown in Fig-ure 1(a) as an example, it is obvious that:X and Z are dependenti.e., for some x, z, P (Z = z|X = x) 6= P (Z = z),Z and Y are dependenti.e., for some z, y, P (Y = y|Z = z) 6= P (Y = y),

These two points are valid because according to the def-inition of causal graph, child node (i.e., Z or Y ) listens toits parent node (i.e., X or Z).X and Y are likely dependenti.e., for some x, y, P (Y = y|X = x) 6= P (Y = y),X and Y are independent, conditional on Zi.e., for all x, y, z, P (Y = y|Z = z,X =x)=P (Y = y|Z =z).

Here, we’re only comparing cases where the value of Zis constant. Since Z does not change, the values of X andY do not change in accordance with it. Therefore, any ad-ditional changes in the values of X and Y must be inde-pendent of each other. For example, we use X , Z, Y torepresent three events “there is fire”, “there is smoke” and

𝑿 𝒀

𝒁

(a) Chain

𝑿 𝒀

𝒁

(b) Fork

𝑯 𝑸

𝑰 𝑽

𝑨

𝑼

(c) Our Graph

Figure 1. Examples of some causal graph configurations and ourgraph of visual dialog

“smoke detector is on” respectively. If Z is always equalto 1 (e.g., “there is smoke” is always true), we will findthat X will not influence Y , because whether “fire” is on,the event “smoke detector is on” is always true. Therefore,X and Y are independent conditional on Z. In the Forkshown in Figure 1(b), the conditional independence rela-tionship among X,Y and Z is also satisfied (i.e., X and Yare independent, conditional on Z).

B. Causal Effect, Confounder and doIn this section, we will give a systematical analysis of

the influence of confounder, why we need do and how tocalculate it. First, we need to give explanation of causal ef-fect. Note that, in the following parts, capital letter denotesvariable and lowercase denotes value.

B.1. Causal Effect

In the naive causal graph X → Y , the effect of X onY should be P (Y |X) − P (Y ). For the prior P (Y ), it isa constant. Therefore, for convenience, in this paper, wesometimes use P (Y |X) to represent the effect of X on Y .Note that there is only one path from X to Y , that meansthe effect from X on Y can only pass through the causalpath. So, P (Y |X) is the causal effect. However, in the realworld, things are not easy like this.

B.2. Confounder

The definition of confounder is introduced in Section 4in the main paper. Use the Fork as an example, by the def-inition, we know that in Figure 1(b), Z is the confounderfor X and Y . In this graph, X do not have causal effecton Y because if we only change X and keep Z, Y will notchange (i.e., the causal effect of X on Y is 0). When wecalculate the causal effect of X on Y in this graph, we findthat we cannot use P (Y |X). That is because the result ofP (Y |X)−P (Y ) is not always zero as we mentioned in theSection 4.2 in the main paper. In conclusion, confoundermakes us cannot use P (Y |X) to represent the causal effect,and we need new notations to represent it.

B.3. do

In the book [6], they introduce a new notationP (Y |do(X = x)), which can be used to represent the

11

causal effect of X on Y . In this section, we will introducewhy it can represent the causal effect and how to calculateit. Note that we will use do(X) to represent do(X = x) forconcision in the following sections.do-operator As we mentioned in the main paper, do is atype of intervention, which means that we assign a value tothe variable instead of that its parent nodes cause it. Forexample, in Figure 1(b), do(X) is that we set variable X asvalue x ignoring its caused function (i.e., arrow X ← Z).Therefore, when we do a variable, we cut off all the arrowsending to the variable, because its parents do not cause itany more. When we calculate P (Y |do(X)), no confounderwill simultaneously cause X and Y because we cut of allthe incoming arrows for X , which ensures our results arecausal effect. We will give an example in Section B.4 totestify the statement.

Now, although we have a notation for causal effect, wecannot calculate it by existing methods. We need tool toderive probability formula from do formula. That is do-calculus.do-calculus Three rules of do-calculus are given in [6] tohelp us derive probability formula.

Rule 1. When we observe a variable X that is irrele-vant to Y (possibly conditional on other variables Z, likethe example “Chain” in Figure 1(a)), then the probabilitydistribution of Y will not change:

P (Y |z,X) = P (Y |z). (1)

Rule 2. If a set Z of variables blocks all back-doorpaths from X to Y , then conditional on Z, like the example“Fork” in Figure 1(b), do(X) is equivalent to see(x):

P (Y |do(X), z) = P (Y |X, z). (2)

Rule 3. We can remove do(X) from P (Y |do(X)) in anycase where there are no causal paths from X to Y :

P (Y |do(X)) = P (Y ). (3)

B.4. Revisit the Fork

Now, we have do-operator to represent causal effect anddo-calculus to calculate it. Let us revisit the problem bringby confounder in Section B.2. In Figure 1(b), P (Y |do(X))can be further written as:

P (Y |do(X))

=∑

zP (Y |do(X), z)P (z|do(X))

=∑

zP (Y |z)P (z|do(X))

=∑

zP (Y |z)P (z)

= P (Y )

(4)

The first line uses Bayes rules, the second one and third oneuse Rule 3. As a result, P (Y |do(X)) − P (Y ) is equal to

0, which accords with our inference for the causal effectof X on Y in Section B.2. That also means we can useP (Y |do(X)) to calculate causal effect.

In conclusion, confounder makes us cannot use P (Y |X)to represent the causal effect, and we obtain a new math-ematical notation P (Y |do(X)) to denote it. For calculat-ing do formula, we need do-calculus to derive probabilityformula from it, and the probability formula can be furthercalculated by observational data. That is the whole story ofconfounder and do.

C. Proofs and DetailsC.1. Proofs of Equation 1

For convenience, we draw our graph in Figure 1(c) andwrite down the equation again, and add one intermediatestep for the formula derivation:

P (A|do(Q,H, I))

=∑

uP (A|do(Q,H, I), u)P (u|do(Q,H, I))

=∑

uP (A|do(Q,H, I), u)P (u|do(H))

=∑

uP (A|do(Q), H, I, u)P (u|H)

=∑

uP (A|Q,H, I, u)P (u|H).

(5)

According to the rules of do-calculus introduced in Sec-tion B.3, we can derive the following proofs: The first stepis according to the Bayes rules. The second one is due toQ, I do not have a causal path to U and Rule 3. Then, thethird step is because H, I do not have a backdoor to A andRule 2. As for the last step, although Q has two backdoorsto A (i.e., Q ← H → U → A and Q ← U → A), accord-ing to Rule 2, when we control U , all of the backdoors areblocked. As a result, the last transformation is valid.

C.2. Details of Loss Functions

Following the Equation 4 given in Section 5.2, we givethree loss functions:Weighted Softmax Loss(R1).

R1 =∑

ilog(softmax(pi)) · si, (6)

where pi is the logit of candidate ai, and si is the corre-sponding normalized relevance score.Binary Sigmoid Loss(R2).

R2 =∑

i[log(σ(pi)) · si + log(σ(1− pi)) · (1− si)] ,

(7)where σ is the sigmoid function, pi is the logit of candidateai, and si is the corresponding normalized relevance score.Generalized Ranking Loss(R3).

R3 =∑

ilog

exp(pi)

exp(pi) +∑

j∈G exp(pj)· si, (8)

12

where pi is the logit of candidate ai, G is a group of candi-dates that has a lower relevance score than ai. si is nor-malized characteristic score (i.e., equals to 0 for ai withrelevance score 0 and equals to 1 for ai with positive rel-evance score). Note that this function is reorganized fromListNet [2].

C.3. Proofs of Formula 6

By [8], we can use NWGM[softmax(fs(ec,u,m))](i.e., normalized weighted geometric mean) to approx-imate E[u|H][softmax(fs(ec,u,m))]. If the probabil-ity of u (i.e., a sample from U ) is P (ui|H), andsoftmax(fs(ec,u,m)) ∝ exp(fs(ec,u,m)). We use nc,ito denote fs(ec,u,m). E[U |H] [softmax(fs(ec,u,m))]can be written as:

E[u|H] [softmax(fs(ec,u,m))]

≈ NWGM[softmax(fs(ec,u,m))]

=

∏i exp(nc,i)

P (ui|H)

∑j

∏i exp(nj,i)

P (ui|H)

=exp(E[u|H][nc,i])∑j exp(E[u|H][nj,i])

,

(9)

where j is the index of all the candidates. If fs(·) is a linearlayer, the equation can be further written as:

E[u|H][softmax(fs(ec,u,m))]≈softmax(E[u|H][fs(ec,u,m)]).(10)

C.4. Details of Principle Implementation

Details of Enhanced LF[3]. After obtaining the visionand language feature H,Q, I, we did the further opera-tions (We use the notation Att to denote attention oper-ation introduced in main paper): 1) History feature re-fine: h = Att(H, qt), where last term of Q (i.e., qt) isguidance. 2) Question and caption feature refine: q =Att(Q, ct), c = Att(C, qt). 3) Vision feature refine:v = Att(V, {q, c}). 4) Second step of vision feature refine:v′ = Att(v, gv([h; q])), where gv is a fully connected layerfollowed by a Softmax function to generate weights for re-fining visual attention. 5) Feature fusion: e = gf ([v′; q]),where gf is a multi-head fully connected layer. More detailscan be found in Table 1.Details of P2. For question type of P2, we manually de-fined 55 types of questions and then counted the occur-rence of ground truth answers under these question types.We set answer candidates with occurrence greater than 5as preferred answers and annotated their score as 1 underthe corresponding question type. At training time, we pre-trained model by original methods for 5 epochs, and gaveanswer candidates normalized relevance score we countedfrom question type for every round of each dialog. Then weused the normalized QT-relevance score to further train our

Index Input Operation Output(1) H (word) (rnd ×40) embed and LSTM H (rnd ×512)(2) C (word) (1× 20) embed and LSTM C (20 ×512)(3) Q (word) (1× 20) embed and LSTM Q (20 ×512)(4) (H, qt) Attention h (1× 512)(5) (C, qt) Attention c (1× 512)(6) (Q, ct) Attention q (1× 512)(7) (I, q, c) Attention v (2× 2048)(8) (v, [h; q]) Attention v′ (1× 2048)(9) (v, q) Concatenate e (1× 2560)

Table 1. The details of Enhanced LF Encoder, where rnd is thecurrent number round of history, C is image caption, I is the imagefeature offered by the official with the dimension 36× 2048 and eis the output of encoder.

model by R2. For Answer Score Sampling, we pre-trainedthe model for 5 epochs, and then further trained the mod-els by dense annotations with our loss functions. As for thedictionary, we set a 100×512 dictionary Du to explore thelatent representation of U . We pre-trained the dictionary byone-hot ground truth answer, and then trained the dictionaryby R3 loss with the dense annotations. Then we fused theprediction of the dictionary and the prediction of pre-trainedmodels by logiti + w · di, where logiti is the prediction oforiginal models, di is the prediction of the dictionary, andw is a manually set weight which we set as 0.1. Finally, wefurther trained the whole model by R3.

C.5. Further Discussion for Metrics

Besides NDCG, there are many other metrics like MRR(i.e., Mean Reciprocal Rank) and R@k (i.e., Recall@k). Weignore these metrics in the main paper is because we thinkNDCG is better than them to evaluate VisDial task. The rea-sons are two-fold, 1) The ground-truth for MRR and othermetrics is the true answer from a user, whose answer prefer-ence (e.g., length) will be consistent in the whole 10-rounddialog. Thus, if the user prefers longer answers such as“Yes, I can see a dog”, then a short “Yes” will be unreason-ably penalized. We argue that this may be one of the reasonswhy traditional models with history shortcut have higherMRR, due to the bias illustrated in Figure 2(a) in the mainpaper. Therefore, if a model has higher MRR, e.g.,“Yes, Ican see a dog” is scored high, then it must force “Yes” tobe low, leading to lower NDCG. That means other metrics(like MRR) have conflicts to NDCG. 2) As we mentioned inSection 1, the answer for VisDial is interactive but VQA hasonly 1 chance, thus, soft-answer score (NDCG) encouragesthe interaction better than 1-hot accuracy (MRR) for Vis-Dial task. Last, Note that NDCG is recommended by Vis-Dial organizer—A. Das—who announced in VisDial Work-shop 2018 and 2019 that it is the only metric to select win-ners. As a result, we choose NDCG as only metric and ig-nore other metrics in the main paper, but for the complete-ness of the paper, we still give the results on other metricsin Section E.

13

Q: Is the cat awake?

GT Answer: Yes

H

𝑸𝟏:Is the pic in color?𝑨𝟏:Yes

𝑸𝟐:Is this summer time?

𝑨𝟐:I can't tell

𝑸𝟑:Any windows?𝑨𝟑:No

𝑸𝟒:Any people around?

𝑨𝟒:Yes 3

𝑯𝟎:the room is very dark and empty Ranked A

Baseline

Ranked A

NDCG:0.62 NDCG:0.93

1.no(1.0)

2.no they are inside a

room(0)

3.0(0.6)

4.no animals(1.0)

5. no just lights(0)

1.no(1.0)

2.no animals(1.0)

3.no animals either(0.8)

4.no animal either(0.4)

5.0(0.6)

Baseline + P1

(a) Matching word “room”

Q: Is the cat awake?

GT Answer: Yes

H

𝑸𝟏:Are the cat old?𝑨𝟏:Could be, it looks like an adult

𝑸𝟐:Are they fat or skinny?

𝑨𝟐:On the heavier side

𝑸𝟑:Is there just 1 or 2?𝑨𝟑:Only 1

𝑸𝟒:Can you see a tv in the room?

𝑨𝟒:No

𝑸𝟓:Is there a human in the photo?𝑨𝟓:No

𝑯𝟎:the white cat with yellow eyes, lies

on the grey tv remote

Ranked A

Baseline

Ranked A

NDCG:0.49 NDCG:0.9

1.yes(1.0)

2.no(0)

3.no, his eyes are open(0)

4.i suppose(0.2)

5.in the image, yes(0.2)

6.no, it's awake(0)

7.no, it's eyes are open(0)

1.yes(1.0)

2.yes, it is(1.0)

3.yes it is(1.0)

4.yes very(0.2)

5.i suppose(0.2)

6.slightly yes(0.2)

7.in the image, yes(0.2)

Baseline + P1

(b) Matching word “eyes”

Q: Are they boarding?

GT Answer: No

H

𝑸𝟏:Is this a passenger train?𝑨𝟏:Yes

𝑸𝟐:Are they both stop?

𝑨𝟐:Yes

𝑸𝟑:Are they at a station?𝑨𝟑:Yes

𝑸𝟒:Can you see passengers?

𝑨𝟒:Yes

𝑸𝟓:Are there a lot of them?𝑨𝟓:Not sure

𝑯𝟎:A train that is parked next to

another train

Ranked A

Baseline

Ranked A

NDCG:0.53 NDCG:0.7

1.no(1.0)

2.not sure(0.2)

3.yes(0)

4.do not appear to be(0.4)

5.maybe(0.0)

1.no(1.0)

2.nope, just standing(0.2)

3.do not appear to be(0.4)

4.no they are not(1.0)

5.i don't think so(0.2)

Baseline + P1

(c) Matching word “yes”

Q: Do you see any buildings?

GT Answer: No

H

𝑸𝟏:What color is the plane?𝑨𝟏:White

𝑸𝟐:Do you see the pilot?

𝑨𝟐:No

𝑸𝟑:Do you see any clouds?𝑨𝟑:Yes

𝑸𝟒:Is this a passenger plane?

𝑨𝟒:Yes

𝑸𝟓:Is there any writing on it?𝑨𝟓:Yes

𝑯𝟎:A small airplane is flying through

a clear blue sky

Ranked A

Baseline

Ranked A

NDCG:0.52 NDCG:0.85

1.no(1.0)

2.yes(0)

3.0(0.4)

4.part of 1(0)

5.no, not up close(0)

1.no(1.0)

2.0(0.4)

3.no i do not(1.0)

4.no you do not(0.6)

5.no i don't(0.8)

Baseline + P1

(d) Matching word “yes”

Figure 2. Word Matching

D. More Qualitative Examples

In this section, we will give more examples of the ad-vantages of our principles mentioned in Section 6, includ-ing two types of history bias elimination for P1 shown inFigure 2(a) to Figure 2(d), and better ranking for P2 shownin Figure 3(a) and Figure 3(b).

E. The Whole Tables

In this section, we will give the whole tables of our ex-periments, especially results on other metrics omitted in themain paper. These ignored metrics are: 1) mean rank ofone-hot ground truth answer (i.e., human response) (Mean),2) recall@k (R@k), which is the existence of the humanresponse in the ranked top-k candidates, 3) mean reciprocal

P2 NDCG(%) MRR(%) R@1(%) R@5(%) R@10(%) Mean

LF

baseline 57.12 64.33 50.46 81.41 90.15 4.03QT 58.97 64.42 50.70 81.40 89.93 4.13

S(R0) 67.82 51.82 40.66 63.31 75.86 8.21S(R1) 71.27 51.40 38.30 65.54 78.78 7.09S(R2) 72.04 50.84 38.65 63.54 77.76 7.26S(R3) 72.36 50.38 37.13 64.22 78.09 7.13

D 72.65 50.18 37.11 64.50 78.59 7.08

LF+P1

baseline 61.88 61.46 47.46 78.63 88.12 4.58QT 62.87 62.09 48.13 79.40 88.79 4.47

S(R0) 69.47 50.54 39.71 61.41 74.55 8.72S(R1) 72.16 51.20 38.56 64.78 77.96 7.46S(R2) 72.85 50.93 38.88 63.41 77.66 7.35S(R3) 73.42 50.53 38.41 63.12 77.54 7.40

D 73.63 50.56 37.99 63.98 77.95 7.26

Table 2. The whole table of comparison for the experiments of ap-plying our principles on the validation set of VisDial v1.0. LF isthe enhanced version as we mentioned. QT, S and D denote ques-tion type, answer score sampling, and hidden dictionary learning,respectively. R0, R1, R2, R3 denote regressive loss, weightedsoftmax loss, binary sigmoid loss, and generalized ranking loss,respectively.

Model P NDCG(%) MRR(%) R@1(%) R@5(%) R@10(%) Mean

LF [3]baseline 57.12 64.33 50.46 81.41 90.15 4.03

+P1 61.88 61.46 47.46 78.63 88.12 4.58+P2 72.65 50.18 37.11 64.50 78.59 7.08

+P1+P2 73.63 50.56 37.99 63.98 77.95 7.26

HCIAE [4]baseline 56.98 64.13 50.31 81.42 90.18 4.09

+P1 60.12 61.00 46.66 78.74 88.34 4.61+P2 71.50 46.96 32.43 63.47 78.43 7.28

+P1+P2 71.99 46.83 33.20 61.64 76.53 7.67

CoAtt [7]baseline 56.46 63.81 49.77 81.20 90.19 4.13

+P1 60.27 60.97 46.83 78.29 87.86 4.66+P2 71.41 47.32 33.35 63.51 77.26 7.56

+P1+P2 71.87 46.41 32.79 61.27 76.37 7.87

RvA [5]baseline 56.74 64.49 50.67 81.64 90.50 3.98

+P1 61.02 62.00 47.99 79.14 89.04 4.42+P2 71.44 50.33 36.85 64.94 78.81 7.05

+P1+P2 72.88 49.34 36.62 62.96 77.75 7.44

Table 3. The whole table of ablative studies on different modelson VisDial v1.0 validation set. P2 indicates the most effective one(i.e., hidden dictionary learning) shown in Table 2. Note that onlyapplying P2 is implemented by the attempts in Section 5 in mainpaper with the history shortcut.

rank (MRR) of the human response in the returned rankedlist. Note that these metrics are not suitable for visual dialogaccording to our discussion.

References[1] Visual Dialog Challenge 2019 Leaderboard. https:

//evalai.cloudcv.org/web/challenges/challenge-page/161/leaderboard/483/.

[2] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and HangLi. Learning to rank: from pairwise approach to listwise ap-proach. In Proceedings of the 24th international conferenceon Machine learning, pages 129–136. ACM, 2007.

[3] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh,Deshraj Yadav, Jose MF Moura, Devi Parikh, and Dhruv Ba-tra. Visual dialog. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 326–335,2017.

[4] Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, andDhruv Batra. Best of both worlds: Transferring knowledgefrom discriminative learning to a generative visual dialog

14

key rank increase:

NDCG: 0.28 0. 88

"nope"(0.8):20 2


13 3"not that i can see"(1.0):"I do not see any"(0.8):21 4

𝑯𝟎:Rows of red, yellow and green round fruit is

linked together in tall, rectangular boxes

𝑸𝟐:Are there any people?

𝑨𝟐:Just 1 person in the back

𝑸𝟑:Are they walking?

𝑨𝟑:No, he looks like he is standing

𝑸𝟒:Does the person work there?

𝑨𝟒:Can't tell

𝑸𝟓:Does he have short sleeve?

𝑨𝟓:Can't tell he's too far away

H

Q:Are there horses?

GT Answer: No

𝑨𝟏:Looks like possibly 2

𝑸𝟏:How many types of fruits?

0

0

0

0.2

0.4

0.4

0.8

1

0.8

0.8

0 1

not really

yes

in the fra background

0

i don't think so

can't see any

i do not see any

not that i can see

nope

no

relevance score

0

0

0

0

0

0

0

0

0

0.8

0 1

far in the background

in the distance

no building

just 2 different tracks

in the far background

no but trees

there's a radio tower

1 buliding in background

yes

no

relevance score

(a) Better ranking for the semantics “no”

key rank increase:

NDCG: 0.57 0. 94

"yes it is"(0.8): 5 2


0

0.8

0.8

0

0

0.8

0

0

0

0.8

blue with some small…

yep

yes, it is

no has table cloth on it

partly

yes it is

the cabinets are white,…

white with yellow

no

yes

relevance score

0

0

0.4

0.4

0.6

0.6

0.8

0.8

0.8

0.8

no

partly

brown

i think so

it appears to be

appears to be

yep

yes, it is

yes it is

yes

relevance score

9 4"yep"(0.8):"yes, it is"(0.8): 8 3

𝑯𝟎:4 toothbrushes -- 2 colored yellow and green,

2 colored yellow and purple -- stand upright

𝑸𝟐:Where are they located?

𝑨𝟐:It almost looks like they are on a counter

𝑸𝟑:Are there any people around?

𝑨𝟑:No, no 1 is around

𝑸𝟒:Is it well lit in the room?

𝑨𝟒:Yes, there's much sunlight

𝑸𝟓:Can you see any counter space?

𝑨𝟓:Yes, there's some clear wooden counter

H

Q:Is it brown?

GT Answer: Yes

𝑨𝟏: 𝐍o, it doesn't look like it

𝑸𝟏:Are these toothbrushes in a bathroom?

(b) Better ranking for the semantics “yes”

Figure 3. Better Ranking

Model NDCG(%) MRR(%) R@1(%) R@5(%) R@10(%) Mean

Ours

P1+P2 (More Ensemble) 74.91 49.13 36.68 62.96 78.55 7.03LF+P1+P2 (Ensemble) 74.19 46.69 32.45 62.13 77.10 7.33

LF+P1+P2 (single) 71.60 48.58 35.98 62.08 77.23 7.48RvA+P1+P2 (single) 71.28 47.71 34.80 61.53 77.10 7.63

CoAtt+P1+P2 (single) 69.81 44.83 30.83 60.65 75.73 8.08HCIAE+P1+P2 (single) 69.66 44.03 29.85 59.50 75.98 8.10

Leaderboard

VD-BERT(Ensemble)∗ 75.13 50.00 38.28 60.93 77.28 6.90Tohuku-CV Lab(Ensemble)∗ 74.88 52.14 38.93 66.60 80.65 6.53

MReaL-BDAI∗ 74.02 52.62 40.03 65.85 79.15 6.76SFCU(Single)∗ 72.80 45.11 32.48 57.78 74.73 7.86

FancyTalk(HeteroFM)∗ 72.33 54.56 42.58 67.27 80.05 6.37Tohuku-CV Lab(Ensemble w/o ft)∗ 66.53 63.19 49.18 80.45 89.75 4.14

Table 4. Our results and comparisons to the recent 2019 2nd Visual Dialog Challenge Leaderboard results on the test-std set of VisDialv1.0. Results are reported by the test server, (∗) is taken from [1]. Note that the top five models in the Leaderboard use the dense fine-tuneimplementation illustrated in Section 5.1.

model. In Advances in Neural Information Processing Sys-tems, pages 314–324, 2017.

[5] Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang,Zhiwu Lu, and Ji-Rong Wen. Recursive visual attention invisual dialog. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 6679–6688,2019.

[6] Judea Pearl and Dana Mackenzie. THE BOOK OF WHY: THENEW SCIENCE OF CAUSE AND EFFECT. Basic Books,2018.

[7] Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Antonvan den Hengel. Are you talking to me? reasoned visualdialog generation through adversarial learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 6106–6115, 2018.

[8] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, AaronCourville, Ruslan Salakhudinov, Rich Zemel, and Yoshua

Bengio. Show, attend and tell: Neural image caption gen-eration with visual attention. In International conference onmachine learning, pages 2048–2057, 2015.

15

Two Causal Principles for Improving Visual Dialog · Two Causal Principles for Improving Visual...

Documents

Transcript of Two Causal Principles for Improving Visual Dialog · Two Causal Principles for Improving Visual...