The Amazing World of Neural Language Generation - Part II ... · The Amazing World of Neural...

52
The Amazing World of Neural Language Generation Part II: Neural Network Modeling for Generation Yangfeng Ji November , Department of Computer Science University of Virginia

Transcript of The Amazing World of Neural Language Generation - Part II ... · The Amazing World of Neural...

  • The Amazing World of NeuralLanguage GenerationPart II: Neural Network Modeling for Generation

    Yangfeng Ji

    November 20, 2020

    Department of Computer ScienceUniversity of Virginia

  • Basic Architecture of Neural NLG Models

    From a (very) high-level viewpoint, neural NLG model can beformulated as with two fundamental components: encoder anddecoder:

    Input x Encoder Decoder Text y

    where

    I Input: A sequence of words x = (G1 , · · · , G

  • Basic Architecture of Neural NLG Models

    From a (very) high-level viewpoint, neural NLG model can beformulated as with two fundamental components: encoder anddecoder:

    Input x Encoder Decoder Text y

    where

    I Input: A sequence of words x = (G1 , · · · , G

  • Roadmap

    This part of the tutorial will cover the three major neural networkmodeling strategies for text generation

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    Neural Templates

    Syntactic

    Semantic

    2

  • Roadmap

    This part of the tutorial will cover the three major neural networkmodeling strategies for text generation

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    Neural Templates

    Syntactic

    Semantic

    2

  • Recurrent Neural Networks as Encoders/Decoders

    The simple implementation of the encoder-decoder framework is torealize each component with a recurrent neural network as illustratedin the following

    s1 s2 s3

    G1 G2 G3

    h1 h2 h3

    H1 H2 H3

    〈B〉 H1 H2

    where each s·/h· is a hidden state of the encoder/decoder recurrentneural network.

    Simple extensions on the encoder side

    I Bi-directional RNNI Stacked LSTM

    3

  • Recurrent Neural Networks as Encoders/Decoders

    The simple implementation of the encoder-decoder framework is torealize each component with a recurrent neural network as illustratedin the following

    s1 s2 s3

    G1 G2 G3

    h1 h2 h3

    H1 H2 H3

    〈B〉 H1 H2

    where each s·/h· is a hidden state of the encoder/decoder recurrentneural network.

    Simple extensions on the encoder side

    I Bi-directional RNNI Stacked LSTM

    3

  • Recurrent Neural Network Decoder

    In general, an decoder can be implemented as an auto-regressivemodel, with the hidden state computed as

    hC = 5 (hC−1 , HC−1) (1)

    hC−1· · ·

    HC−1

    hC · · ·

    HC

    HC−1

    For generation, the probability of the word HC is computed as

    ?(H) = softmax (]>hC) (2)

    where]> ∈ ℝ�×+ is a learnable weight matrix for the output layer

    Decoding algorithms will be discussed in the next part of the tutorial.

    4

  • Recurrent Neural Network Decoder

    In general, an decoder can be implemented as an auto-regressivemodel, with the hidden state computed as

    hC = 5 (hC−1 , HC−1) (1)

    hC−1· · ·

    HC−1

    hC · · ·

    HC

    HC−1

    For generation, the probability of the word HC is computed as

    ?(H) = softmax (]>hC) (2)

    where]> ∈ ℝ�×+ is a learnable weight matrix for the output layer

    Decoding algorithms will be discussed in the next part of the tutorial.

    4

  • Recurrent Neural Network Decoder

    In general, an decoder can be implemented as an auto-regressivemodel, with the hidden state computed as

    hC = 5 (hC−1 , HC−1) (1)

    hC−1· · ·

    HC−1

    hC · · ·

    HC

    HC−1

    For generation, the probability of the word HC is computed as

    ?(H) = softmax (]>hC) (2)

    where]> ∈ ℝ�×+ is a learnable weight matrix for the output layer

    Decoding algorithms will be discussed in the next part of the tutorial.4

  • Contextual Information

  • Roadmap

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    6

  • Attention Mechanism

    Attention mechanism (Bahdanau et al., 2015) provides a way ofactively encoding context information from preceding textx = (G1 , · · · , G

  • Attention Mechanism

    Attention mechanism (Bahdanau et al., 2015) provides a way ofactively encoding context information from preceding textx = (G1 , · · · , G

  • Attention Mechanism

    Attention mechanism (Bahdanau et al., 2015) provides a way ofactively encoding context information from preceding textx = (G1 , · · · , G

  • Attention Mechanism in FFNs

    Rush et al. (2015) use the attention mechanism in a feed-forwardneural network for abstractive sentence summarization, where theneural network architecture is constructed as the following

    Diagram (b) represents the attention-based encoder, which use theinput x and a fixed context window y2 = y8−2+1:8 to compute theattention weights

    " ∝ exp(x̃Vy′2) x̄8 =8+&∑@=8−&

    x̃8/& enc(x , y2) = "T x̄ (3)

    8

  • Attention Mechanism in FFNs

    Rush et al. (2015) use the attention mechanism in a feed-forwardneural network for abstractive sentence summarization, where theneural network architecture is constructed as the following

    Diagram (b) represents the attention-based encoder, which use theinput x and a fixed context window y2 = y8−2+1:8 to compute theattention weights

    " ∝ exp(x̃Vy′2) x̄8 =8+&∑@=8−&

    x̃8/& enc(x , y2) = "T x̄ (3)8

  • Copying Mechanism: CopyNet

    Gu et al. (2016) propose a model called CopyNet to directly copy aphrases from input x to output y,

    ?(HC | hC) = ?6(HC | hC) + ?2(HC | hC) (4)

    where ?6(HC | hC) is a probability distribution defined on Vand?2(HC | hC) is a probability distribution defined only on the input x.

    In other words, it defines a mixture model with equal mixturecoefficients. 9

  • Copying Mechanism: Pointer-Generator Networks

    See et al. (2017) propose a similar idea to copy words from input x tooutput y in text summarization. The probability of F ∈ Ybeing thenext word is

    ?(HC = F | ·) = �?6(HC = F | ·) + (1 − �)∑G8∈x

    �(F = G8)C ,8 (5)

    Among the many implementation differences with (Gu et al., 2016),the work uses

    I a soft gate � ∈ (0, 1) todecide the probability ofgeneration instead ofcopying

    I the probability of copy aword F is defined by theattention weightsassociated with it

    10

  • Copying Mechanism: Pointer-Generator Networks

    See et al. (2017) propose a similar idea to copy words from input x tooutput y in text summarization. The probability of F ∈ Ybeing thenext word is

    ?(HC = F | ·) = �?6(HC = F | ·) + (1 − �)∑G8∈x

    �(F = G8)C ,8 (5)

    Among the many implementation differences with (Gu et al., 2016),the work usesI a soft gate � ∈ (0, 1) to

    decide the probability ofgeneration instead ofcopying

    I the probability of copy aword F is defined by theattention weightsassociated with it

    10

  • Copying Mechanism: Pointer-Generator Networks

    See et al. (2017) propose a similar idea to copy words from input x tooutput y in text summarization. The probability of F ∈ Ybeing thenext word is

    ?(HC = F | ·) = �?6(HC = F | ·) + (1 − �)∑G8∈x

    �(F = G8)C ,8 (5)

    Among the many implementation differences with (Gu et al., 2016),the work usesI a soft gate � ∈ (0, 1) to

    decide the probability ofgeneration instead ofcopying

    I the probability of copy aword F is defined by theattention weightsassociated with it

    10

  • Memory Modules

    One example is proposed in (Clark et al., 2018) for entity-driven text(story) generation, where each memory cell is associated with aparticular entity, to encode entity related information from context

    hC−1· · ·

    attend

    hC · · ·

    EMNLP

    attend

    e��!e�"#!%e#���!e···

    where e· is a distributed representation of an entity

    I Entity prediction ?(4 = EMNLP) ∝ exp(hTC−1]4eEMNLP +w 5 f (4)),

    where f is the surface feature related to this entity (Ji et al., 2017).I Dynamic entity updating e(new)EMNLP ∝ �Ce

    (new)EMNLP + (1 − �C)hC , where

    �C ∈ (0, 1) determines how much information should be encodedfrom e(new)EMNLP .

    11

  • Memory Modules

    One example is proposed in (Clark et al., 2018) for entity-driven text(story) generation, where each memory cell is associated with aparticular entity, to encode entity related information from context

    hC−1· · ·

    attend

    hC · · ·

    EMNLP

    attend

    e��!e�"#!%e#���!e···

    where e· is a distributed representation of an entity

    I Entity prediction ?(4 = EMNLP) ∝ exp(hTC−1]4eEMNLP +w 5 f (4)),

    where f is the surface feature related to this entity (Ji et al., 2017).

    I Dynamic entity updating e(new)EMNLP ∝ �Ce(new)EMNLP + (1 − �C)hC , where

    �C ∈ (0, 1) determines how much information should be encodedfrom e(new)EMNLP .

    11

  • Memory Modules

    One example is proposed in (Clark et al., 2018) for entity-driven text(story) generation, where each memory cell is associated with aparticular entity, to encode entity related information from context

    hC−1· · ·

    attend

    hC · · ·

    EMNLP

    attend

    e��!e�"#!%e#���!e···

    where e· is a distributed representation of an entity

    I Entity prediction ?(4 = EMNLP) ∝ exp(hTC−1]4eEMNLP +w 5 f (4)),

    where f is the surface feature related to this entity (Ji et al., 2017).I Dynamic entity updating e(new)EMNLP ∝ �Ce

    (new)EMNLP + (1 − �C)hC , where

    �C ∈ (0, 1) determines how much information should be encodedfrom e(new)EMNLP .

    11

  • Latent Representation

  • Roadmap

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    13

  • Variational Autoencoders in NLG

    One-page summary of variational autoencoder (Kingma and Welling,2014)

    x

    z

    Inference Generation

    A formulation of variational autoencoders for text generation

    h = Encoder(x) (6)z = h + & & ∼N(0, diag(22)) (7)x̃ = Decoder(z) (8)

    An impact of using VAE is that it (1) produces a robust encoder forinput x and (2) enriches the hidden space H.

    14

  • Variational Autoencoders in NLG

    One-page summary of variational autoencoder (Kingma and Welling,2014)

    x

    z

    Inference Generation

    A formulation of variational autoencoders for text generation

    h = Encoder(x) (6)z = h + & & ∼N(0, diag(22)) (7)x̃ = Decoder(z) (8)

    An impact of using VAE is that it (1) produces a robust encoder forinput x and (2) enriches the hidden space H.

    14

  • Variational Autoencoders in NLG

    One-page summary of variational autoencoder (Kingma and Welling,2014)

    x

    z

    Inference Generation

    A formulation of variational autoencoders for text generation

    h = Encoder(x) (6)z = h + & & ∼N(0, diag(22)) (7)x̃ = Decoder(z) (8)

    An impact of using VAE is that it (1) produces a robust encoder forinput x and (2) enriches the hidden space H.

    14

  • Variational Seq2seq Models

    An example application of variational autoencoder in languagegeneration

    I The mean and variance of latent variable z is computed by thelinear transformations of the last hidden states from the RNNencoder s<

    I Other influential ideas from this work are KL cost annealing andadversarial evaluation

    (Bowman et al., 2016)15

  • Variational Seq2seq Models

    An example application of variational autoencoder in languagegeneration

    I The mean and variance of latent variable z is computed by thelinear transformations of the last hidden states from the RNNencoder s<

    I Other influential ideas from this work are KL cost annealing andadversarial evaluation

    (Bowman et al., 2016)15

  • Variational Seq2seq Models

    An example application of variational autoencoder in languagegeneration

    I The mean and variance of latent variable z is computed by thelinear transformations of the last hidden states from the RNNencoder s<

    I Other influential ideas from this work are KL cost annealing andadversarial evaluation

    (Bowman et al., 2016)15

  • Conditional VAE

    A simple formulation of conditional VAE is proposed by Sohn et al.(2015), which initially was used in computer vision.

    x

    z

    y

    In text generation, consider x, y and z are input texts, output texts,and latent representations of input texts

    I Generation network: ?�(y | x , z), where z ∼ ?�(z | x)I Inference network: @)(z | x , y)I Example applications: in response generation, latent variable z

    indicates whether the next utterance is a generic response (Shenet al., 2017)

    16

  • Generative Adversarial Networks

    The basic pipeline of GAN is described in the following pipeline

    Input x

    Latent variable z Generator �

    Discriminator � Prediction

    x′

    I One goal is to learn a generator � that can generate text x′ withthe same quality as x (in other words, to “fool” the discriminator)

    I Potential applications are to adopt the framework as onecomponent in other task-specific generation tasks (e.g., styletransfer)

    17

  • GANs for Generation

    As a straightforward application of adversarial learning is to replacethe generator with a sequence-to-sequence model as we discussedbefore, as proposed by Li et al. (2017)

    Response y Discriminator �

    DecoderEncoderContext x

    Prediction

    y′

    I The decoder is to generate a response y′ based on input context xI The discriminator is to predict whether a response is generated

    by humans or the decoder

    18

  • Learning Disentangled Representations

    One way to utilize a discriminator is to learn disentangledrepresentations and make sure latent representations encoding theexpected attributes for generation.

    An example of learning disentangled representations is proposed byHu et al. (2017), with

    I Encoder(z) = Encoder(x)

    I Decoderx̂ = Decoder(z, c)

    I Discriminatorc = Dis(x̂)

    where c encodes the attributes of a text (e.g., sentiment categories,formality). Other work on learning disentangled representationsinclude (Bao et al., 2019; Cheng et al., 2020)

    19

  • Learning Disentangled Representations

    One way to utilize a discriminator is to learn disentangledrepresentations and make sure latent representations encoding theexpected attributes for generation.

    An example of learning disentangled representations is proposed byHu et al. (2017), with

    I Encoder(z) = Encoder(x)

    I Decoderx̂ = Decoder(z, c)

    I Discriminatorc = Dis(x̂)

    where c encodes the attributes of a text (e.g., sentiment categories,formality). Other work on learning disentangled representationsinclude (Bao et al., 2019; Cheng et al., 2020)

    19

  • Transformers

    Vaswani et al. (2017): “Attention mechanism . . . , allowing modeling ofdependencies without regard to their distance in the input or outputsequences.”

    Recent applications of transformers simply use them as basic buildingblocks, just like the way of using LSTM

    20

  • Generative Pre-trained Transformers (GPT)

    GPT (Radford et al., 2018) is trained simply by predicting the nextwords with

    h0 = ]4xC−::C−1 +]?h; = transformer_block(h;−1) ∀; ∈ [1, 12]

    ?(HC | h=) = softmax (]>h=)(9)

    where]4 and]? are the word and position embeddings.

    21

  • Pre-trained Models as Teachers

    Besides the generative models like GPT-2 and its variants, we can alsouse BERT for text generation. One example from (Chen et al., 2020b)is that BERT can be used as a teacher model to help train asequence-to-sequence models for better output probability

    22

  • Structural Information

  • Roadmap

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    Neural Templates

    Syntactic

    Semantic

    24

  • Generation with Neural Templates

    An extension of variational antoencoders is to incorporate sequentialinformation in latent variables. For example, Wiseman et al. (2018)propose a semi-Markov model on the latent variable sequencez = (I1 , . . . , I=) to capture dependency among adjacent words fortext-to-data generation.

    A neural hidden semi-Markov modeldecoderI Transition distribution ?(IC+1 | IC , x)I Length distribution ?(;C+1 | IC+1)I Emission distribution?(yC−;C+1:C | IC = :, ;C = ; , x) 25

  • Graph-to-Sequence Generation

    An example of inputs is an AMR graph, where the task is to generatea text with the same meaning as the input graph.

    I The whole model is a layer-wise LSTM, within each layer LSTMruns on the linearized graph, and between layers the hiddenstates are updated with neighbor nodes from the previous layer

    I The idea of linearizing an AMR graph (using depth-first search)is proposed by Konstas et al. (2017)

    I Other ideas of incorporating AMR graphs: Graph neuralnetwork (Chen et al., 2020a)

    26

  • Graph-to-Sequence Generation

    An example of inputs is an AMR graph, where the task is to generatea text with the same meaning as the input graph.

    I The whole model is a layer-wise LSTM, within each layer LSTMruns on the linearized graph, and between layers the hiddenstates are updated with neighbor nodes from the previous layer

    I The idea of linearizing an AMR graph (using depth-first search)is proposed by Konstas et al. (2017)

    I Other ideas of incorporating AMR graphs: Graph neuralnetwork (Chen et al., 2020a)

    26

  • Graph-to-Sequence Generation

    An example of inputs is an AMR graph, where the task is to generatea text with the same meaning as the input graph.

    I The whole model is a layer-wise LSTM, within each layer LSTMruns on the linearized graph, and between layers the hiddenstates are updated with neighbor nodes from the previous layer

    I The idea of linearizing an AMR graph (using depth-first search)is proposed by Konstas et al. (2017)

    I Other ideas of incorporating AMR graphs: Graph neuralnetwork (Chen et al., 2020a)

    26

  • Graph-to-Sequence Generation

    An example of inputs is an AMR graph, where the task is to generatea text with the same meaning as the input graph.

    I The whole model is a layer-wise LSTM, within each layer LSTMruns on the linearized graph, and between layers the hiddenstates are updated with neighbor nodes from the previous layer

    I The idea of linearizing an AMR graph (using depth-first search)is proposed by Konstas et al. (2017)

    I Other ideas of incorporating AMR graphs: Graph neuralnetwork (Chen et al., 2020a)

    26

  • Dialogue Generation with Semantic Exemplars

    A straightforward application of GPT is the response generationproposed by Gupta et al. (2020). The prediction is still done byword-by-word prediction, where the input sequence and outputsequence are 1

    x = (Dialogue context,Response frames,Response text)y = (〈MASK〉,Response frames,Response text)

    (10)

    1A semantic frame example Perception: hear, say, see, smell, feel27

  • Summary

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    Neural Templates

    Syntactic

    Semantic

    28

  • Summary

    Encoder-DecoderFramework

    Contextual Information Latent Representation Structural Information

    Attention Mechanism

    Copying Mechanism

    Memory Modules

    Variational Autoen-coders

    Generative AdversarialNetworks

    Disentangled Repre-sentations

    Pretrained LanguageModels

    Neural Templates

    Syntactic

    Semantic

    28

  • Reference I

    Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In InternationalConference on Learning Representations. arXiv: 1409.0473.

    Bao, Y., Zhou, H., Huang, S., Li, L., Mou, L., Vechtomova, O., Dai, X.-y., and Chen, J. (2019). Generating Sentences from DisentangledSyntactic and Semantic Spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6008–6019,Florence, Italy. Association for Computational Linguistics.

    Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating Sentences from a Continuous Space. InProceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany. Association forComputational Linguistics.

    Chen, M., Tang, Q., Wiseman, S., and Gimpel, K. (2019). Controllable Paraphrase Generation with a Syntactic Exemplar. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics, pages 5972–5984, Florence, Italy. Association for ComputationalLinguistics.

    Chen, Y., Wu, L., and Zaki, M. J. (2020a). Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation.arXiv:1908.04942 [cs]. arXiv: 1908.04942.

    Chen, Y.-C., Gan, Z., Cheng, Y., Liu, J., and Liu, J. (2020b). Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the58th Annual Meeting of the Association for Computational Linguistics, pages 7893–7905, Online. Association for Computational Linguistics.

    Cheng, P., Min, M. R., Shen, D., Malon, C., Zhang, Y., Li, Y., and Carin, L. (2020). Improving Disentangled Text Representation Learningwith Information-Theoretic Guidance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages7530–7541, Online. Association for Computational Linguistics.

    Clark, E., Ji, Y., and Smith, N. A. (2018). Neural Text Generation in Stories Using Entity Representations as Context. In Proceedings of the 2018Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (LongPapers), pages 2250–2260, New Orleans, Louisiana. Association for Computational Linguistics.

    Gu, J., Lu, Z., Li, H., and Li, V. O. (2016). Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

    Gupta, P., Bigham, J. P., Tsvetkov, Y., and Pavel, A. (2020). Controlling Dialogue Generation with Semantic Exemplars. arXiv:2008.09075 [cs].arXiv: 2008.09075.

    Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. (2017). Toward Controlled Generation of Text. volume 70 of Proceedings ofMachine Learning Research, pages 1587–1596, International Convention Centre, Sydney, Australia. PMLR.

    29

  • Reference II

    Ji, Y., Tan, C., Martschat, S., Choi, Y., and Smith, N. A. (2017). Dynamic Entity Representations in Neural Language Models. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 1830–1839, Copenhagen, Denmark. Association forComputational Linguistics.

    Kingma, D. P. and Welling, M. (2014). Auto-Encoding Variational Bayes. In ICLR. arXiv: 1312.6114.Konstas, I., Iyer, S., Yatskar, M., Choi, Y., and Zettlemoyer, L. (2017). Neural AMR: Sequence-to-Sequence Models for Parsing and

    Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages146–157, Vancouver, Canada. Association for Computational Linguistics.

    Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Copenhagen, Denmark. Association forComputational Linguistics.

    Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training.Rush, A. M., Chopra, S., and Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015

    Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for ComputationalLinguistics.

    See, A., Liu, P. J., and Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Associationfor Computational Linguistics.

    Shen, X., Su, H., Li, Y., Li, W., Niu, S., Zhao, Y., Aizawa, A., and Long, G. (2017). A Conditional Variational Framework for DialogGeneration. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages504–509, Vancouver, Canada. Association for Computational Linguistics.

    Sohn, K., Lee, H., and Yan, X. (2015). Learning Structured Output Representation using Deep Conditional Generative Models. In Advancesin neural information processing systems, pages 3483–3491.

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need.In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in NeuralInformation Processing Systems 30, pages 5998–6008. Curran Associates, Inc.

    Wiseman, S., Shieber, S., and Rush, A. (2018). Learning Neural Templates for Text Generation. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing, pages 3174–3187, Brussels, Belgium. Association for Computational Linguistics.

    30

    Contextual InformationLatent RepresentationStructural InformationReferences