Multimodal Residual Learning for Visual QA
NamHyuk Ahn
Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)
Visual QAEvaluation Metric
- Robust to variabilityinter-human
- Human accuracy is almost 90
- 248,349 Training questions (82,783 Images)
- 121,512 Validation questions (40,504 Images)
- 244,302 Testing questions (81,434 Images)
Stacked Attention Network
Motivation
- Answering question requires multi-step reasoning
- With {bicycles, window, street, baskets, dogs} objects
- To answer good question,pinpoint relevant region.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network
- Image Model• Extract image feature using
CNN
- Question Model• Extract semantic vector
using CNN or LSTM
- Stacked Attention• Multi-step reasoning
with attention layer
Stacked AttentionMulti-step reasoning
using attention layer
Image / Question Model- Image Model
• Get feature map from raw pixel Image
• Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512)
• Additional layer to fit to question feature
- Question Model•
Stacked Attention Model
- Global image feature leads to suboptimal due to noise from irrelevant object / region.
- Instead use SAM to pinpoint relevant region
- Given image feature matrixand question vector ,
14x14 attention distribution
- Get weighted sum of image vectors from each region.
-refined query vector
Result
Residual Learning
Problem of degradation- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
Residual Network (ResNet)
Residual Block- To avoid degradation
problem, add shortcut connection.
- Element-wise addition with F(x) and shortcut connection, and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection
Multimodal Residual Network
Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA dataset (not today :(.
- Introducing a method to visualize spatial attention effect of joint residual mappings
Background
SAN- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]- With just elem-wise multiple,
visual and question feature embed very well.
MRN- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global multiplication [Lu et al.] does.
Quantitative Analysis- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in question cause overfitting.
- (d) identity shortcut cause degradation (extra linear mapping is needed).
- (e) performs reasonable, but extra shortcut is not essential.
Quantitative Analysis
# of Learning blocks- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less feature dim (2048 vs 4096).
# of Answer Class- Trade-off relation among
answer type, but 2k is best
- Implicit attention with multiplication
- Get high-resolution attention map
Reference
- Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.
Top Related