Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA...
Transcript of Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA...
![Page 1: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/1.jpg)
Multimodal Compact Bilinear Pooling for VQA
AkiraFukui1,2,DongHuk Park1,DaylenYang1,AnnaRohrbach1,3,TrevorDarrell1,MarcusRohrbach1
1UCBerkeleyEECS,CA,UnitedStates2SonyCorp.,Tokyo,Japan3MaxPlanckInstituteforInformatics,Saarbrücken,Germany
![Page 2: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/2.jpg)
Multimodallanguageandvisualunderstanding
Atablefulloffoodforafeast
Description
![Page 3: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/3.jpg)
Multimodallanguageandvisualunderstanding
Thebowlwiththebrownsouce
Grounding
![Page 4: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/4.jpg)
Multimodallanguageandvisualunderstanding
Gravy
VisualQuestionAnswering
Whatisthebrownsouce?
![Page 5: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/5.jpg)
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
¨ Allelementscaninteract¨ Multiplicativeinteraction
![Page 6: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/6.jpg)
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteract¨ Multiplicativeinteraction- Difficulttolearnoutputclassification
Concatenate
FC FC
RELU
![Page 7: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/7.jpg)
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
¨ Allelementscaninteractþ Multiplicativeinteraction- Difficulttolearninputembedding
ElementwiseMultiplication
FC FC
RELU⨀
![Page 8: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/8.jpg)
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteraction
OuterProduct/BilinearPooling
FC[Lin ICCV 2015]
[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015
![Page 9: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/9.jpg)
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteraction¨ High#activations&computation¨ High#parameters
OuterProduct/BilinearPooling
FC
[Lin ICCV 2015]
2048
2048 4 million
4 million x 1000
[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015
![Page 10: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/10.jpg)
spoonplatebowltablefoodcorn…person
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
CompactBilinearPooling
FC2048
2048
16k x 1000
MCB
[Gao CVPR 16]
[GaoCVPR16]YangGao,OscarBeijbom,NingZhang,andTrevorDarrell.Compactbilinearpooling.CVPR2016
[ICLRWorkshops2016] Fine-grainedposeprediction,normalization,andrecognitionNZhang,EShelhamer,YGao,TDarrell
![Page 11: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/11.jpg)
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteraction¨ Low#activations&computationþ Low#parameters
FC
16k x 1000
RandomProjection:Countsketch Ψ
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
![Page 12: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/12.jpg)
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
FC
16k x 1000
Ψ
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
Ψ
Convolution
∗
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
![Page 13: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/13.jpg)
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
FC
16k x 1000
Ψ
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
Ψ
FFT
FFTFFT
-1
Convolution
⨀
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
![Page 14: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/14.jpg)
Relatedwork
• Alternativeapproachtomultiplicativeinteractions- DPPNet:Hyeonwoo Noh,PaulHongsuck Seo,andBohyung Han.Imagequestionansweringusingconvolutionalneuralnetworkwithdynamicparameterprediction.CVPR2016
Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classificationnetwork and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from thecandidate weights obtained from the parameter prediction network.
by constructing multiple branches connected to a commonCNN architecture. In this work, we hope to solve the het-erogeneous recognition tasks using a single CNN by adapt-ing the weights in the dynamic parameter layer. Since thetask is defined by the question in ImageQA, the weightsin the layer are determined depending on the question sen-tence. In addition, a hashing trick is employed to predicta large number of weights in the dynamic parameter layerand avoid parameter explosion.
3.2. Problem FormulationImageQA systems predict the best answer a given an im-
age I and a question q. Conventional approaches [16, 23]typically construct a joint feature vector based on two inputsI and q and solve a classification problem for ImageQA us-ing the following equation:
a = argmaxa∈Ω
p(a|I, q;θ) (1)
where Ω is a set of all possible answers and θ is a vectorfor the parameters in the network. On the contrary, we usethe question to predict weights in the classifier and solve theproblem. We find the solution by
a = argmaxa∈Ω
p(a|I;θs,θd(q)) (2)
where θs and θd(q) denote static and dynamic parameters,respectively. Note that the values of θd(q) are determinedby the question q.
4. Network ArchitectureFigure 2 illustrates the overall architecture of the pro-
posed algorithm. The network is composed of two sub-networks: classification network and parameter predictionnetwork. The classification network is a CNN. One of the
fully-connected layers in the CNN is the dynamic parame-ter layer, and the weights in the layer are determined adap-tively by the parameter prediction network. The parame-ter prediction network has GRU cells and a fully-connectedlayer. It takes a question as its input, and generates a real-valued vector, which corresponds to candidate weights forthe dynamic parameter layer in the classification network.Given an image and a question, our algorithm estimatesthe weights in the dynamic parameter layer through hash-ing with the candidate weights obtained from the parameterprediction network. Then, it feeds the input image to theclassification network to obtain the final answer. More de-tails of the proposed network are discussed in the followingsubsections.
4.1. Classification Network
The classification network is constructed based on VGG16-layer net [24], which is pre-trained on ImageNet [6]. Weremove the last layer in the network and attach three fully-connected layers. The second last fully-connected layer isthe dynamic parameter layer whose weights are determinedby the parameter prediction network, and the last fully-connected layer is the classification layer whose output di-mensionality is equal to the number of possible answers.The probability for each answer is computed by applying asoftmax function to the output vector of the final layer.
We put the dynamic parameter layer in the second lastfully-connected layer instead of the classification layer be-cause it involves the smallest number of parameters. As thenumber of parameters in the classification layer increases inproportion to the number of possible answers, predicting theweights for the classification layer may not be a good op-tion to general ImageQA problems in terms of scalability.Our choice for the dynamic parameter layer can be inter-preted as follows. By fixing the classification layer while
32
![Page 15: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/15.jpg)
Experimentalsetup(withoutAttention)
• Solver- Cross-entropy-loss,Adam,learningrate0.0007
• FeatureExtraction- ResNet 152,image:448x448
• Answers- 3000mostfrequentontrain- Samplingwithprobabilityofanswers
• Trainedontrain/validatedonval /testedontest-dev
Isthisgoingtobeafeast?
CNN(ResNet152)
LSTM,drop
LSTM,drop
FullConnected
Yes
2048
MCB16k
L2Norm
alization
Embed,Tanh
13k ~ 20k
300
1024
1024
2048
Softmax
SignedSqrt
L2Norm
alization
16k
16k
3000
![Page 16: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/16.jpg)
58.7
58.5
59.8
57.8
56.4
58.6
57.1
58.4
57.5
56.5
54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0
MCB(128x128->4k)
FullBilinear(128x128->16k)
MCB(2048x2048->16k)
EltwiseProduct+FC+FC
EltwiseProduct+FC
EltwiseProduct
Concat+FC+FC
Concat+FC
Concat
EltwiseSum
Trainedontrain,test-devAcc.[%]
AblationComparisontoothermultimodalmethods
AblationComparisontoothermultimodalmethods
• MCBachieveshighestaccuracy
![Page 17: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/17.jpg)
58.7
58.5
59.8
57.8
56.4
58.6
57.1
58.4
57.5
56.5
54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0
MCB(128x128->4k)
FullBilinear(128x128->16k)
MCB(2048x2048->16k)
EltwiseProduct+FC+FC
EltwiseProduct+FC
EltwiseProduct
Concat+FC+FC
Concat+FC
Concat
EltwiseSum
Trainedontrain,test-devAcc.[%]
AblationComparisontoothermultimodalmethods
AblationComparisontoothermultimodalmethods
• MCBcomparabletoFullBilinear
![Page 18: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/18.jpg)
DimensionalityofMCB
• DimensionalityofMCBdecidestheperformanceofouterproductapproximation
2048
MCB
Dim size
2048
58.4
58.8
59.459.7
59.859.7
57.5
58.0
58.5
59.0
59.5
60.0
1024 2048 4096 8192 16000 32000
test-devAcc.[%]
VQAOpen-Endedtest-devaccuracy
![Page 19: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/19.jpg)
Multimodallanguageandvisualunderstanding
Gravy
VisualQuestionAnswering
Whatisthebrownsouce?
![Page 20: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/20.jpg)
MCBwithAttention
• PredictspatialattentionswithMCB
CNN(ResNet152)
Tile
MCB
WE,LSTM
16k x14x14SignedSqrt
L2Norm
alization2048x14x14
2048x14x14
Conv,Relu512 x 14 x 14
Softmax1 x 14 x 14
WeightedSum
Conv
FullConnected
CornMCB16k
Softmax
SignedSqrt
L2Norm
alization
16k
16k
3000
2048
2048
Whatistheyellowfood?
Attentionforcaptioning:- K.Xu,Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttentionAttentionforVQA:- H.Xu,K.Saenko Ask,AttendandAnswer:ExploringQuestion-GuidedSpatialAttentionforVisualQuestionAnswering- J.Lu HierarchalQuestion-ImageCo-AttentionforVisualQuestionAnswering
![Page 21: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/21.jpg)
AttentionVisualizations
Isthispersonwearingahat?Yes[Groundtruth:Yes]
![Page 22: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/22.jpg)
ResultsonMCBwithAttention
• MCBperformswellwithAttention
58.4
59.8
58.4
62.5
56.0
57.0
58.0
59.0
60.0
61.0
62.0
63.0
Concat+FC MCB Concat+FC+Attention
MCB+Attention
test-devAcc.[%]
PerformanceofAttentionModels
![Page 23: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/23.jpg)
62.564.2 65.1
66.7
70.2
58
60
62
64
66
68
70
72
MCB+Attention(train)
MCB+Attention(train+val)
MCB+Attention
(train+val+genome)
MCB+Attention+Ensemble
(train+val+genome)
MCB+Attention+Ensemble
(train+val+genome)Multiple Choice
test-devAcc.[%]
VQAOpen-Ended accuracyforgenomeandensemble
• DataAugmentation- VQAdatafromVisualGenomeDataset- Additional1MQuestionandanswerpairs- Removedarticles,Singlewordanswer
• Ensembles- Averagetheoutput ofSoftmax overmodels
Techniquestoimproveperformance
Visual genome: Connecting language and vision using crowdsourced dense image annotations.
![Page 24: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/24.jpg)
MCBonotherDatasetsandTasks
• Visual7w(MultipleChoice) • VisualGrounding
Our architecture for Visual 7w : MCB with Attention and Answer Encoding.
Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.
43.843.9
47.746.5
47.447.9
48.7
40 42 44 46 48 50
Plummer et al.Wang et al.
Rohrbach et al.Concat
Eltwise ProdEltwise Prod + Conv
MCB
Accuracy on Flickr30k Entities
54.3
52.8
62.2
45 50 55 60 65
Zhu et al.
Concat + Attention
MCB + Attention
Accuracy on Visual7W
![Page 25: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/25.jpg)
Examples for VQA
![Page 26: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/26.jpg)
What is the woman feeding the giraffe?Carrot[Groundtruth: Carrot]
Attention Visualizations
![Page 27: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/27.jpg)
What color is her shirt?Purple[Groundtruth: Purple]
Attention Visualizations
![Page 28: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/28.jpg)
What is her hairstyle for the picture?Ponytail[Groundtruth: Ponytail]
Attention Visualizations
![Page 29: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/29.jpg)
What color is the chain on the red dress?Pink[Groundtruth: Gold]
Attention Visualizations
• Correct Attention, Incorrect Fine-grained Recognition
![Page 30: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/30.jpg)
Is the man going to fall down?No[Groundtruth: No]
Attention Visualizations
![Page 31: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/31.jpg)
What is the surface of the court made of?Clay[Groundtruth: Clay]
Attention Visualizations
![Page 32: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/32.jpg)
What sport is being played?Tennis[Groundtruth: Tennis]
Attention Visualizations
![Page 33: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/33.jpg)
What does the shop sell?Clocks[Groundtruth: Hot Dogs]
Attention Visualizations
• Incorrect Attention
![Page 34: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/34.jpg)
What credit card company is on the banner in the background?Budweiser[Groundtruth: Mastercard]
Attention Visualizations
• Correct Attention, Incorrect Concept Association
![Page 35: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/35.jpg)
Conclusions
• MultimodalCompactBilinearPooling- AllelementsinteractMultiplicatively- Compact andEfficient
• MCBwithAttention- Successfullypredictspatialattention
• GeneralizationCapability- Performanceimprovementinothervisionandlanguagetasks- Visual7W,VisualGrounding
- Compatiblewithothermodels- Applicabletogeneralmultimodaltasks,notonlyonvisionandlanguage
![Page 36: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus](https://reader030.fdocuments.in/reader030/viewer/2022040204/5ece286cf6bb9c0f49301b2d/html5/thumbnails/36.jpg)
Thankyouforyourattention!
Demo:demo.berkeleyvision.orgCode:https://github.com/akirafukui/vqa-mcb/
MultimodalCompactBilinearPoolingforVisualQuestionAnsweringandGrounding
AkiraFukui,DongHuk Park,DaylenYang,AnnaRohrbach,TrevorDarrell,MarcusRohrbach
Arxiv 2016