Adversarial learning for neural dialogue generation
-
Upload
keon-kim -
Category
Technology
-
view
745 -
download
0
Transcript of Adversarial learning for neural dialogue generation
![Page 1: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/1.jpg)
Adversarial Learning for Neural Dialogue Generation
Presenter: Keon Kim
Original Paper by: Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter and Dan Jurafsky
![Page 2: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/2.jpg)
TODOs● What is this about?● Result First!● Why and How on Text Data?● Adversarial Learning?● The Model Breakdown
- Generative- Discriminative
● Training Methods- Monte Carlo Policy Gradient (REINFORCE)- Reward for Every Generation Step (REGS)
● Teacher Forcing● Notes
![Page 3: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/3.jpg)
What Is This About?
- Adversarial Training for open-domain dialogue generation
“to train to produce sequences that are indistinguishable from human-generated dialogue utterances.”
![Page 4: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/4.jpg)
Result First!
Adversarially-Trained system generates higher-quality responses than previous baselines!
![Page 5: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/5.jpg)
Adversarial Training
MinMax Game between Generator vs Discriminator
![Page 6: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/6.jpg)
Why and How on Text Data?
- Analogous to Turing Test ( just discriminator instead of human )
- Enjoyed great success in computer vision
- But hard to apply to NLP because the text space is too discontinuous
- Small updates generally don’t change the reinforcement feedback
- Progress has been made and this is one of them
Given a dialogue history X consisting of a sequence of dialogue utterances, the model needs to generate a response Y. We view the process of sentence generation as a sequence of actions that are taken according to a policy defined by an encoder-decoder recurrent neural networks.
![Page 7: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/7.jpg)
Model Breakdown
The model has two main parts, G and D:
Generative Model (G)- Generates a response y given dialogue history x.- Standard Seq2Seq model with Attention Mechanism
Discriminative Model (D)- Binary Classifier that takes as input a sequence of dialogue
utterances {x, y} and outputs label indicating whether the input is generated by human or machines
- Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode being a machine or human generated dialogues.
![Page 8: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/8.jpg)
Training Methods (Important Part)
Policy Gradient Methods:
- The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm.
Uses Monte Carlo Policy Gradient (REINFORCE)
approximated by likelihood ratio
![Page 9: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/9.jpg)
Training Methods (Important Part)
Policy Gradient Methods:
- The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm.
Uses Monte Carlo Policy Gradient (REINFORCE)
approximated by likelihood ratio
classification scorebaseline value to reduce the variance of the estimate while keeping it unbiased
policygradient in parameter space
![Page 10: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/10.jpg)
Training Methods (Important Part)
Policy Gradient Methods:
- The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm.
Uses Monte Carlo Policy Gradient (REINFORCE)
approximated by likelihood ratio
scalar reward
policy updates by the direction of the reward in the parameter space
![Page 11: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/11.jpg)
Training Methods (Cont’d)
Problem with REINFORCE:
- has disadvantage that the expectation of the reward is approximated by only one sample
- reward associated with the sample is used for all actions
- REINFORCE assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know
- Proper credit assignment in training would give separate rewards, - most likely a neutral token for token I, and negative reward to don’t and know.
Authors of the paper calls it: Reward for Every Generation Step (REGS)
Input : What’s your name human : I am Johnmachine : I don’t know
![Page 12: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/12.jpg)
Reward for Every Generation Step (REGS)
We need rewards for intermediate steps.
Two Strategies Introduced:
1. Monte Carlo (MC) Search
2. Training Discriminator For Rewarding Partially Decoded Sequences
![Page 13: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/13.jpg)
Monte Carlo Search
1. Given a partially decoded step s, the model keeps sampling tokens from the distribution until the decoding
finishes
2. Repeats N times (N generated sequences will share a common prefix s).
3. These N sequences are fed to the discriminator, the average score of which is used as a reward.
![Page 14: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/14.jpg)
Rewarding Partially Decoded Sequences
Directly train a discriminator that is able to assign rewards to both fully and partially decoded sequences
- Break generated sequences into partial sequences
Problem:
- Earlier actions in a sequence are shared among multiple training examples for discriminator.
- Result in overfitting
The author proposes a similar strategy used in AlphaGo to mitigate the problem.
![Page 15: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/15.jpg)
Rewarding Partially Decoded Sequences
For each collection of subsequences of Y, randomly sample only one example from positive examples
and one example from negative examples, which are used to update discriminator.
- Time effective but less accurate than MC model.
![Page 16: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/16.jpg)
Rewarding Partially Decoded Sequences
For each collection of subsequences of Y, randomly sample only one example from positive examples
and one example from negative examples, which are used to update discriminator.
- Time effective but less accurate than MC model.
classification score baseline value to reduce the variance of the estimate while keeping it unbiased
policy
gradient in parameter space
![Page 17: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/17.jpg)
Rewarding Partially Decoded Sequences
classification score baseline value
policy
gradient in parameter space
classification scorebaseline value
policygradient in parameter space
![Page 18: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/18.jpg)
Teacher Forcing
Generative model is still unstable, because:
- generative model can only be indirectly exposed to the gold-standard target sequences through the
reward passed back from the discriminator.
- This reward is used to promote or discourage the generator’s own generated sequences.
This is fragile, because:
- Once a generator accidentally deteriorates in some training batches
- And Discriminator consequently does an extremely good job in recognizing sequences from the
generator, the generator immediately gets lost
- It knows that the generated results are bad, but does not know what results are good.
![Page 19: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/19.jpg)
Teacher Forcing (Cont’d)
The author proposes feeding human generated responses to the generator for model updates.
- discriminator automatically assigns a reward of 1 to the human responses and feed it to the
generator to use this reward to update itself.
- Analogous to having a teacher intervene and force it to generate the true responses
Generator then updates itself using this reward on the human generated example only if the reward is
larger than the baseline value.
![Page 20: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/20.jpg)
Pseudocode for the Algorithm
![Page 21: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/21.jpg)
Result Again
Adversarially-Trained system generates higher-quality responses than previous baselines!
![Page 22: Adversarial learning for neural dialogue generation](https://reader035.fdocuments.in/reader035/viewer/2022081605/58f9b2f71a28ab48518b4597/html5/thumbnails/22.jpg)
Notes
It did not show great performance on abstractive summarization task.
Maybe because adversarial training strategy is more beneficial to:
- Tasks in which there is a big discrepancy between the distributions of the generated sequences and
the reference target sequences
- Tasks in which input sequences do not bear all the information needed to generate the target
- in other words, there is no single correct target sequence in the semantic space.