Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated...
Transcript of Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated...
![Page 2: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/2.jpg)
CMPUT 651 (Fall 2019)
Seq2Seq• Question:
Why do we feed back the generated words?
How can we train Seq2Seq models?
How do we do inference?
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." In NIPS, 2014.
![Page 3: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/3.jpg)
CMPUT 651 (Fall 2019)
Why do we feed back the generated words?• First thought: Feeding back is unnecessary
- � is predicted from �
- � depends on �
- � depends on �
y2 h2
h2 h1
y1 h1
h1 h2
y1 y2
� brings no informationy1} ⇒
![Page 4: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/4.jpg)
CMPUT 651 (Fall 2019)
Multi-Modal Distribution• Continuous distribution
• Discrete distribution
- Image in some embedding space, sentences are multi-modal distributed
- “A beats B” vs. “B is beaten by A”
h1 h2
y1 y2
![Page 5: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/5.jpg)
CMPUT 651 (Fall 2019)
Training Seq2Seq Models
• Decoder’s input layer- Attempt#1: Feed in the predicted words- Attempt#2: Feed in the groundtruth word- Scheduled sampling
• Loss �
- Suppose we known “groundtruth” target sequence- Recall BP with multiple losses
J = J1 + J2 + ⋯ + JT
![Page 6: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/6.jpg)
CMPUT 651 (Fall 2019)
Inference• Decoder’s input layer
- Feed in the predicted words
• When do we terminate?- Include a special token “EOS” (end of sequence) in training
- If “EOS” is predicted, the sentence is terminated by def
EOS
![Page 7: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/7.jpg)
CMPUT 651 (Fall 2019)
Caveat• Batch implementation
- Padding EOS or 0 vector=> Incorrect- Masking => Correct
EOS
h̃t = RNN(ht−1, xt)
ht = (1 − m)ht−1 + mh̃t
• Implementation should always be equivalent to math derivations
![Page 8: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/8.jpg)
CMPUT 651 (Fall 2019)
Inference Criteria• Single-output classification
- Max a posteriori inference � Minimal empirical loss
- �
• Sentence generation
- If we want to generate the “best” sentence: �
- Is greedy correct? �
- The cost of exhaustive search?
⇔
y = argmax p(y |x)
y = argmax p(y |x)
yi = argmax p(yi |y<i, x)
EOS
![Page 9: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/9.jpg)
CMPUT 651 (Fall 2019)
Greedy vs. Exhaustive Search
Greedy Beam Search
Exhaustive search
For each step
Pick the best word
Try a few best words
Try every word
Maintain One sequence
Several good partial sequences
All possible combinations
EOS
![Page 10: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/10.jpg)
CMPUT 651 (Fall 2019)
Beam Search
EOS
B=2
![Page 11: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/11.jpg)
CMPUT 651 (Fall 2019)
Beam Search
EOS
B=2
EOS
![Page 12: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/12.jpg)
CMPUT 651 (Fall 2019)
Beam Search
EOS
B=2
![Page 13: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/13.jpg)
CMPUT 651 (Fall 2019)
Beam Search
EOS
• A list of best partial sequences � [ ]
• For every decoding step �
- For every partial seq � and every word �
• Expand � as �
- � top- � expanded subsequences among all �
• Return the most probable sequence in the beam
(existing a terminated sequence better than all � )
S
t
s ∈ S w ∈ 𝒱
s (s, w)
S = B (s, w)
s ∈ S
![Page 14: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/14.jpg)
CMPUT 651 (Fall 2019)
Issues with Autoregressiveness
EOS
• Error accumulation
- 1st word good, 2nd worse, 3rd even worse, etc.
• Label bias
- Not “label imbalance” problem
- BS bias towards high probable words at the beginning
- Locally normalized models prefer high probable (but possibly unimportant) words
![Page 15: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/15.jpg)
CMPUT 651 (Fall 2019)
Information Bottleneck
• Last hidden state
- Has to capture all source information
• Average/Mean pooling
- Still loses information
- Not directly related to the current decoded word
EOS
![Page 16: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/16.jpg)
CMPUT 651 (Fall 2019)
Attention Mechanism
• Dynamically aligning words
- Average pooling => weighted pooling
- Alignment dependents on the current word to be generated
- Alignment to a particular source word obviously also dependents on that source word itself
EOS
Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
![Page 17: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/17.jpg)
CMPUT 651 (Fall 2019)
EOS
Convex vs Linear Weighting
h1
h2
h3h4
h5
• Average pooling
• Weighted pooling
c =1N
h1 + ⋯ +1N
hN
c = α1h1 + ⋯ + αNhN
What are � ?α1, ⋯, αN
![Page 18: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/18.jpg)
CMPUT 651 (Fall 2019)
EOS
Computing Attention
s(t)j = s(h(tar)
j , h(src)i )
α̃(t)j = exp(s(t)
j )
α(t)j =
exp(s(t)j )
∑j′�exp(s(t)j′� )
Score (-Energy)
Unnormalized measure
Probability
Denominator: Partition function
s(t)j = (h(tar)
j )⊤h(src)i
s(t)j = (h(tar)
j )⊤Wh(src)i
Inner-product
Metric learning
Neural layer
s(t)j = u⊤f(W[h(tar)
j ; h(src)i ])
![Page 19: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/19.jpg)
CMPUT 651 (Fall 2019)
Where are we?
Picking the best (how do you know)
Average pooling
Max pooling
Attention: in the convex hull
Linear combination: anywhere in the space
This is not too wrong. “Meaning is use” —WittgensteinIn machine learning, how you train is how you predict
![Page 20: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/20.jpg)
CMPUT 651 (Fall 2019)
Attention as Soft Alignment
Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
![Page 21: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/21.jpg)
CMPUT 651 (Fall 2019)
Traditional Phrase-Based MT
Explicitly modeling alignment
IBM Models 1—5: A spectrum of simplifications
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D. and Mercer, R.L., 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp.263-311.
![Page 22: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/22.jpg)
CMPUT 651 (Fall 2019)
Attention is all you need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. Attention is all you need. In NIPS, 2017.
• Information processed by multi-head attention
• Sinusoidal position embedding
- BERT uses learned position embedding
Transformer(Horrible terminology)
![Page 23: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/23.jpg)
CMPUT 651 (Fall 2019)
• Attention probability is essentially a softmax
- with varying # of target classes
• Attention content is aggregating information by weighted sum
- Especially # of entities may change
Attention beyond MT
![Page 24: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/24.jpg)
CMPUT 651 (Fall 2019)
More Applications
• Dialogue systems• Summarization• Paraphrase generation• etc.
Encoder does not have to be a sequence model• Table-to-test generation• Graph-to-text generation
Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, Zhifang Sui. Order-planning neural text generation from structure data. In AAAI, 2018.
![Page 25: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/25.jpg)
CMPUT 651 (Fall 2019)
Weston, Jason, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.Sukhbaatar, S., Weston, J. and Fergus, R., 2015. End-to-end memory networks. In NIPS, 2015.
Memory-Based NetworkQuestion answering (synthetic dataset)
![Page 26: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/26.jpg)
CMPUT 651 (Fall 2019)
Weston, Jason, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.Sukhbaatar, S., Weston, J. and Fergus, R., 2015. End-to-end memory networks. In NIPS, 2015.
Memory-Based NetworkBasically a multi-layer attention network
![Page 27: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/27.jpg)
CMPUT 651 (Fall 2019)
Neural Turing MachineChomsky hierarchy Grammar Language Automata Neural
analog
Type-3 Regular expression
Finite state machine RNN
Type-2 Context-free ND Pushdown automata
Type-1 Context-sensitive Esoteric
Type-0 Recursive enumerable
Turing machine NTM
A → aB |a
A → a
αAβ → αγβ
αAβ → γ
![Page 28: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/28.jpg)
CMPUT 651 (Fall 2019)
• If states are discrete, RNN is FSM
• # distinct states � exp(units)
• However, they are not free states
- Transitions subject to a parametric function
• Unknown (at least to me) how real-valued states add to computational power
• At least, using denseness of real numbers to express potentially infinite steps of recursion is inefficient
∝
A Rough Thought on RNN Computational Power
![Page 29: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/29.jpg)
CMPUT 651 (Fall 2019)
• Augment RNN with a memory pad
• Read & write by memory addressing
• Attention-based memory addressing
- Content-based addressing
- Allocation-based addressing
- Link-based addressing
Graves, A., et al., 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), p.471.
Neural Turing Machine
![Page 30: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/30.jpg)
CMPUT 651 (Fall 2019)
Neural Turing MachineContent-based memory addressing
Temporal linkage-based memory addressing
Dynamic memory allocation
• Memory addressing is purely hypothetical• May not learn true “programs”• Thoughts for future work
- Learn from restricted class of automata (e.g., PDA)- Make intermediate execution results Markov blanket
Problems
![Page 31: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/31.jpg)
CMPUT 651 (Fall 2019)
Memory for Domain Adaptation
Asghar, N., Mou, L., Selby, K.A., Pantasdo, K.D., Poupart, P. and Jiang, X., 2018. Progressive Memory Banks for Incremental Domain Adaptation. arXiv preprint arXiv:1811.00239.
![Page 32: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/32.jpg)
CMPUT 651 (Fall 2019)
Conclusion & Take-Home Msg• Sequence-to-sequence training
- Training: Step-by-step supervised learning
- Inference: Greedy, Beam search, sampling
• Attention
- Adaptive weighted sum of source information
- Alignment in MT
- Aggregated information
![Page 33: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/33.jpg)
CMPUT 651 (Fall 2019)
Suggested Reading• Automata theory
• Scheduled sampling. Bengio, Samy, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. "Scheduled sampling for sequence prediction with recurrent neural networks." In Advances in Neural Information Processing Systems, pp. 1171-1179. 2015.
• Phrase-based MT. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D. and Mercer, R.L., 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp.263-311.
• NTM. Graves, A., et al., 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), p.471.
![Page 34: Seq2Seq Models Attention Mechanism · Seq2Seq • Question: Why do we feed back the generated words? How can we train Seq2Seq models? How do we do inference? Sutskever, Ilya, Oriol](https://reader033.fdocuments.in/reader033/viewer/2022060523/60532e86360fc7192751aa05/html5/thumbnails/34.jpg)
CMPUT 651 (Fall 2019)
More References• Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." In NIPS, 2014.• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and
Polosukhin, I. Attention is all you need. In NIPS, 2017.• Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, Zhifang Sui.
Order-planning neural text generation from structure data. In AAAI, 2018. • Weston, Jason, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.• Sukhbaatar, S., Weston, J. and Fergus, R., 2015. End-to-end memory networks. In NIPS,
2015.• Asghar, N., Mou, L., Selby, K.A., Pantasdo, K.D., Poupart, P. and Jiang, X., 2018.
Progressive Memory Banks for Incremental Domain Adaptation. arXiv preprint arXiv:1811.00239.