StructuredTriplet Learning with POS -tag Guided Attention for … ·...
Transcript of StructuredTriplet Learning with POS -tag Guided Attention for … ·...
![Page 1: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/1.jpg)
Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering
Zhe Wang1, Xiaoyi Liu2, Liangjian Chen1, Limin Wang4,Yu Qiao3, Xiaohui Xie1, Charless Fowlkes1
1 CS UC Irvine, 2Microsoft, 3SIAT CAS, 4CVL ETH
![Page 2: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/2.jpg)
Y Zhu, O, Groth,M. Bernstein, Fei-Fei Li. “Visual7W:GroundedQuestionAnsweringinImages”, CVPR, 2016
MultipleChoiceVisualQuestion Answering(VQA)
![Page 3: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/3.jpg)
Our contributions
The (spatial) attention only depends on the image-question pair input- We consider the image-answer interactions when computing attention
![Page 4: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/4.jpg)
Our contributions
The (spatial) attention only depends on the image-question pair input- We consider the image-answer interactions when computing attention
The sentence representation is limited: either LSTM encoding or simple average of word vectors- We propose to integrate Part-of-speech tags and convolutional n-gram processing to better encode
query and answer sentences.
![Page 5: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/5.jpg)
Our contributions
The (spatial) attention only depends on the image-question pair input- We consider the image-answer interactions when computing attention
The sentence representation is limited: either LSTM encoding or simple average of word vectors- We propose to integrate Part-of-speech tags and convolutional n-gram processing to better encode query and answer sentences.
Image-question-answer triplets corresponding to the same image-question pair are treated independently- We introduce structured triplet learning and mine “hard negative” triplets to improve the system
![Page 6: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/6.jpg)
Part-of-speech-tag (POS) guided attentionWhy was the hand of the woman over the left shoulder of the man?
Glove
WRB V O N O O N O O J N O O N
Element wise Product
Questions
WordEmbedding
POSTag
NewWordVector
POSWeight
![Page 7: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/7.jpg)
Convolutional N-Gram
Why was the hand of the woman over the left shoulder of the man?Glove
WRB V O N O O N O O J N O O N
……
Conv-2Gram
Conv-1Gram
Conv-3Gram
Convolutional filtering of word vectors encodes local sentence context
WordVector
New Words/Phrases/Sentences
![Page 8: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/8.jpg)
Structured Triplet Learning:
Score[ Question, Image, Answer(i) ]
ti = ground truth. {0,1}
Logistic loss
Structured loss
For a given question, the correct answer should score higher than incorrect (competing) answers by a specified margin
![Page 9: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/9.jpg)
Visualizing Attention Maps
Correct Answer
Wrong Answer
InputImage
What is the ceiling covered with? Paper Umbrellas/ Tiles
![Page 10: StructuredTriplet Learning with POS -tag Guided Attention for … · YZhu,O,Groth,M.Bernstein,Fei-Fei Li.“Visual7W: Grounded Question Answering in Images”,CVPR, 2016 Multiple](https://reader033.fdocuments.in/reader033/viewer/2022052814/609f02b2aa9fea5a9841c35e/html5/thumbnails/10.jpg)
Question Word Attention
Correct Answer
Wrong Answer
InputImage
What sits on the police motorcycle?A helmet/ a pair of gloves