Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a...
Transcript of Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a...
![Page 1: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/1.jpg)
Sequence to Sequence Models
Mausam
(Slides by Yoav Goldberg, Graham Neubig, Prabhakar Raghavan)
![Page 2: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/2.jpg)
Neural Architectures
• Mapping from a sequence to a single decision.
• with CNN or BiLSTM acceptor.
• Mapping from two sequences to a single decision.
• with Siamese network.
• Mapping from a sequence to a sequence of same length.
• with BiLSTM transducer
![Page 3: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/3.jpg)
what do we do if the input and output
sequences are of different lengths?
![Page 4: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/4.jpg)
we already have an architecture from
0 to n mapping.
(sequence generation)
![Page 5: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/5.jpg)
• Training: an RNN Transducer.
• Generation: the output of step i is input to step i+1.
RNN Language Models
![Page 6: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/6.jpg)
RNN Language Model for generation
• Define the probability distribution over the next item in a sequence (and hence the probability of a sequence).
![Page 7: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/7.jpg)
RNN Language Models
![Page 8: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/8.jpg)
RNN Language Models
Generating sentences is nice, but what if we want
to add some additional conditioning contexts?
![Page 9: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/9.jpg)
Conditioned Language Model
• Not just generate text, generate text according to some specification
![Page 10: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/10.jpg)
RNN Language Model
for Conditioned generation
Let's add the condition variable to the equation.
P( ti | t1,…,ti-1 )
P( ti | c, t1,…,ti-1 )
T
( T | C )
(a vector)
![Page 11: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/11.jpg)
RNN Language Model
for Conditioned generation
what if we want to condition on an entire sentence?
just encode it as a vector...
![Page 12: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/12.jpg)
A simple Sequence to Sequenceconditioned generation
![Page 13: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/13.jpg)
How to Pass Hidden State
![Page 14: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/14.jpg)
RNN Language Model
for Conditioned generation
Let's add the condition variable to the equation.
![Page 15: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/15.jpg)
RNN Language Model
for Conditioned generation
Let's add the condition variable to the equation.
![Page 16: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/16.jpg)
RNN Language Model
for Conditioned generation
![Page 17: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/17.jpg)
RNN Language Model
for Conditioned generation
what if we want to condition on an entire sentence?
![Page 18: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/18.jpg)
Sequence to Sequence
conditioned generation
This is also called
"Encoder Decoder"
architecture.
Encoder
Decoder
Decoder is
just a conditioned
language model
![Page 19: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/19.jpg)
Sequence to Sequence
training graph
![Page 20: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/20.jpg)
The Generation Problem
We have a probability model, how do we use it to generate a sentence?
Two methods:
• Sampling: Try to generate a random sentence according to the probability distribution.
• Argmax: Try to generate the sentence with the highest probability.
![Page 21: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/21.jpg)
Ancestral Sampling
![Page 22: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/22.jpg)
Greedy Search
![Page 23: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/23.jpg)
Beam Search
![Page 24: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/24.jpg)
How to evaluate?
• Basic Paradigm
• Use parallel test set
• Use system to generate translations
• Compare target translations w/ reference
![Page 25: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/25.jpg)
Human Evaluation
![Page 26: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/26.jpg)
BLEU
![Page 27: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/27.jpg)
METEOR
• Like BLEU in overall principle, with many other tricks: consider paraphrases, reordering, and function word/content word difference
• Pros: Generally significantly better than BLEU, esp. for high-resource languages
• Cons: Requires extra resources for new languages (although these can be made automatically), and more complicated
![Page 28: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/28.jpg)
Perplexity
• Calculate the perplexity of the words in the held-out set without doing generation
• Pros: Naturally solves multiple-reference problem!
• Cons: Doesn’t consider decoding or actually generating output. May be reasonable for problems with lots of ambiguity.
![Page 29: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/29.jpg)
Case Study:Smart Reply in
Gmail
![Page 30: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/30.jpg)
Preprocessing an incoming email
● Language detection○ Currently handle English, Portuguese, Spanish … a few
more languages are in preparation
● Tokenization of subject and message body
● Sentence segmentation
● Normalization of infrequent words and entities –
replaced by special tokens
● Removal of quoted and forward email portions
● Removal of greeting and closing phrases (“Hi
John”,… “Regards, Mary”)
![Page 31: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/31.jpg)
![Page 32: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/32.jpg)
LSTM translation
Vinyals & Le, 2015
![Page 33: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/33.jpg)
Pick the best suggestions
(LSTM)
![Page 34: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/34.jpg)
Is it worth it? ● Precision/accuracy - how well can we guess
good replies?○ Self-reinforcing behavior - often machine
predictions are “good enough”
○ Machines learn from humans, and vice versa
● Coverage - do most emails have simple,
predictable responses?○ Do a small number of utterances cover a large
fraction of responses?
○ Language/cultural variations? Linguistic entropy
![Page 35: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/35.jpg)
Metric ● What fraction of the time do users select a
suggested reply?○ How many replies do we suggest? 3
○ Constraint based on user interface, but also users’ ability
to quickly process choices
● We get a boost from allowing users to edit
responses before sending○ In early studies, users were nervous that choosing a
response would instantly send
○ Careful tuning of this UI gave us bigger gains than a lot of
ML tuning
![Page 36: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/36.jpg)
Some early observations
![Page 37: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/37.jpg)
Some early observations
![Page 38: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/38.jpg)
A scoring algorithm doesn’t make a product
● Semantic variation: doesn’t help if all three suggestions say the same thing …○ Can’t simply take the 3 highest scoring suggestions
● The “I love you” problem○ Some responses are unhelpful and a human can say them, but
not a computer …*○ A lot of responses in the training corpus have “I love you”○ In many cases this isn’t appropriate○ “Family friendliness”
● Sensitivity○ There are many incoming emails where you don’t want the
computer to guess replies - Bad news, etc
● * in general our expectations of “working” AI are higher than of humans
![Page 39: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/39.jpg)
![Page 40: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/40.jpg)
>10%of Gmail responses are Smart
Replies.(Users accept computer-generated replies.)
![Page 41: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/41.jpg)
Encoder-Decoder with different modalities
The encoded conditioning context need not
be text, or even a sequence.
![Page 42: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/42.jpg)
Encoder-Decoder with different modalities
• Encode: image to vector.Decode: a sentence describing the image.
This sort-of works.In my opinion, looks more impressive than really is.
![Page 43: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/43.jpg)
![Page 44: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/44.jpg)
![Page 45: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/45.jpg)
![Page 46: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/46.jpg)
![Page 47: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/47.jpg)
Sentence RepresentationYou can't cram the meaning of a whole %&!$#
sentence into a single $&!#* vector!
![Page 48: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/48.jpg)
Encoder
Decoder
Sequence to Sequence
conditioned generation
main idea:
encoding
a single vector is
too restrictive.
![Page 49: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/49.jpg)
Attention
• Instead of the encoder producing a single vector for the sentence, it will produce a one vector for each word.
![Page 50: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/50.jpg)
Encoder
Decoder
Sequence to Sequence
conditioned generation
![Page 51: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/51.jpg)
Decoder
Sequence to Sequence
conditioned generation
Encoder
![Page 52: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/52.jpg)
Decoder
Sequence to Sequence
conditioned generation
Encoder
![Page 53: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/53.jpg)
Decoder
Sequence to Sequence
conditioned generation
Encoder
but how do we feed
this sequence
to the decoder?
![Page 54: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/54.jpg)
Sequence to Sequence
conditioned generation
Encoder
we can combine the different outputs
into a single vector (attended summary)
![Page 55: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/55.jpg)
Sequence to Sequence
conditioned generation
Encoder
we can combine the different outputs
into a single vector (attended summary)
a different single vector
at each encoder input.
![Page 56: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/56.jpg)
![Page 57: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/57.jpg)
![Page 58: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/58.jpg)
Sequence to Sequence
conditioned generation
Encoder
![Page 59: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/59.jpg)
Sequence to Sequence
conditioned generation
Encoder
decoder state
![Page 60: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/60.jpg)
encoder-decoder with attention
![Page 61: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/61.jpg)
encoder-decoder with attention
![Page 62: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/62.jpg)
encoder-decoder with attention
![Page 63: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/63.jpg)
encoder-decoder with attention
![Page 64: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/64.jpg)
encoder-decoder with attention
• Encoder encodes a sequence of vectors, c1,...,cn
• At each decoding stage, an MLP assigns a relevance score to each Encoder vector.
• The relevance score is based on ci and the state sj
• Weighted-sum (based on relevance) is used to produce the conditioning context for decoder step j.
![Page 65: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/65.jpg)
encoder-decoder with attention
• Decoder "pays attention" to different parts of the encoded sequence at each stage.
• The attention mechanism is "soft" -- it is a mixture of encoder states.
• The encoder acts as a read-only memory for the decoder
• The decoder chooses what to read at each stage
![Page 66: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/66.jpg)
Attention
• Attention is very effective for sequence-to-sequence tasks.
• Current state-of-the-art systems all use attention.(this is basically how Machine Translation works)
• Attention also makes models somewhat more interpretable.
• (we can see where the model is "looking" at each stage of the prediction process)
![Page 67: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/67.jpg)
Attention
![Page 68: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/68.jpg)
Attention
![Page 69: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/69.jpg)
Complexity
• Encoder decoder:
• Encoder-decoder with attention:
![Page 70: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/70.jpg)
Complexity
• Encoder decoder: O(n+m)
• Encoder-decoder with attention: O(nm)
![Page 71: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/71.jpg)
Beyond Seq2Seq
• Can think of a general design pattern in neural nets:
– Input: sequence, query
– Encode the input into a sequence of vectors
– Attend to the encoded vectors, based on query (weighted sum, determined by query)
– Predict based on the attended vector
![Page 72: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/72.jpg)
Attention Functions
• Additive Attention:
• Dot Product:
• Multiplicative Attention:
v: attended vec, q: query vecMLPatt(q;v)=
![Page 73: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/73.jpg)
Additive vs Multiplicative
dk is the dimensionality of q and v
![Page 74: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/74.jpg)
Key-Value Attention
• Split v into two vectors v=[vk;vv]
– vk: key vector
– vv: value vector
• Use key vector for computing attention
MLPatt(q;v)= ug(W1vk + W2q) //additive
• Use value vector for computing attended summary
vj (vv)i
![Page 75: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/75.jpg)
Multi-head Key-Value Attention
• For each head
– Learn different projection matrices Wq, Wk, Wv
• MLPatt(q;v)= [(vkWk).(qWq)]/sqrt(dk)
• For summary use vvWv (instead of vv)
• Train many such heads and
– use aggr(all such attended summaries)
![Page 76: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/76.jpg)
Hard Attention
![Page 77: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/77.jpg)
Self-attention/Intra-attention
![Page 78: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/78.jpg)
Recall the attended Enc-dec
![Page 79: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/79.jpg)
Self attention with LSTM
• c (in prev slide) = h (in this slide)
• h (hidden state); x (input); h~ (attended summary)
• (Attended) Hidden state/Cell State
• Rest of LSTM
![Page 80: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/80.jpg)
Do we “need” an LSTM?
Objective
– RNN is slow; can’t be parallelized
– Reduce sequential computation
• Self-attention encoder (Transformer)
– creatively combines layers of attention
– with other bells and whistles
• Self-attention decoder!!
![Page 81: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/81.jpg)
![Page 82: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/82.jpg)
![Page 83: Sequence to Sequence Models - Indian Institute of ... · A scoring algorithm doesn’t make a product Semantic variation: doesn’t help if all three suggestions say the same thing](https://reader034.fdocuments.in/reader034/viewer/2022042220/5ec67712e690c36f2e144973/html5/thumbnails/83.jpg)
Summary
• RNNs are very capable learners of sequential data.
• n -> 1: (bi)RNN acceptor
• n -> n : biRNN (transducer)
• 1 -> m : conditioned generation (conditioned LM)
• n -> m : conditioned generation (encoder-decoder)
• n -> m : encoder-decoder with attention