The Amazing World of Neural Language Generation - Part II ... · The Amazing World of Neural...

The Amazing World of NeuralLanguage GenerationPart II: Neural Network Modeling for Generation

Yangfeng Ji

November 20, 2020

Department of Computer ScienceUniversity of Virginia

Basic Architecture of Neural NLG Models

From a (very) high-level viewpoint, neural NLG model can beformulated as with two fundamental components: encoder anddecoder:

Input x Encoder Decoder Text y

where

I Input: A sequence of words x = (G1 , · · · , G

Roadmap

This part of the tutorial will cover the three major neural networkmodeling strategies for text generation

Encoder-DecoderFramework

Contextual Information Latent Representation Structural Information

Attention Mechanism

Copying Mechanism

Memory Modules

Variational Autoen-coders

Generative AdversarialNetworks

Disentangled Repre-sentations

Pretrained LanguageModels

Neural Templates

Syntactic

Semantic

2

Recurrent Neural Networks as Encoders/Decoders

The simple implementation of the encoder-decoder framework is torealize each component with a recurrent neural network as illustratedin the following

s1 s2 s3

G1 G2 G3

h1 h2 h3

H1 H2 H3

〈B〉 H1 H2

where each s·/h· is a hidden state of the encoder/decoder recurrentneural network.

Simple extensions on the encoder side

I Bi-directional RNNI Stacked LSTM

3

Recurrent Neural Network Decoder

In general, an decoder can be implemented as an auto-regressivemodel, with the hidden state computed as

hC = 5 (hC−1 , HC−1) (1)

hC−1· · ·

HC−1

hC · · ·

HC

HC−1

For generation, the probability of the word HC is computed as

?(H) = softmax (]>hC) (2)

where]> ∈ ℝ�×+ is a learnable weight matrix for the output layer

Decoding algorithms will be discussed in the next part of the tutorial.

4

Recurrent Neural Network Decoder

In general, an decoder can be implemented as an auto-regressivemodel, with the hidden state computed as

hC = 5 (hC−1 , HC−1) (1)

hC−1· · ·

HC−1

hC · · ·

HC

HC−1

For generation, the probability of the word HC is computed as

?(H) = softmax (]>hC) (2)

where]> ∈ ℝ�×+ is a learnable weight matrix for the output layer

Decoding algorithms will be discussed in the next part of the tutorial.4

Contextual Information

Roadmap



Attention Mechanism

Copying Mechanism

Memory Modules

6

Attention Mechanism

Attention mechanism (Bahdanau et al., 2015) provides a way ofactively encoding context information from preceding textx = (G1 , · · · , G

Attention Mechanism in FFNs

Rush et al. (2015) use the attention mechanism in a feed-forwardneural network for abstractive sentence summarization, where theneural network architecture is constructed as the following

Diagram (b) represents the attention-based encoder, which use theinput x and a fixed context window y2 = y8−2+1:8 to compute theattention weights

" ∝ exp(x̃Vy′2) x̄8 =8+&∑@=8−&

x̃8/& enc(x , y2) = "T x̄ (3)

8

Attention Mechanism in FFNs

Rush et al. (2015) use the attention mechanism in a feed-forwardneural network for abstractive sentence summarization, where theneural network architecture is constructed as the following

Diagram (b) represents the attention-based encoder, which use theinput x and a fixed context window y2 = y8−2+1:8 to compute theattention weights

" ∝ exp(x̃Vy′2) x̄8 =8+&∑@=8−&

x̃8/& enc(x , y2) = "T x̄ (3)8

Copying Mechanism: CopyNet

Gu et al. (2016) propose a model called CopyNet to directly copy aphrases from input x to output y,

?(HC | hC) = ?6(HC | hC) + ?2(HC | hC) (4)

where ?6(HC | hC) is a probability distribution defined on Vand?2(HC | hC) is a probability distribution defined only on the input x.

In other words, it defines a mixture model with equal mixturecoefficients. 9

Copying Mechanism: Pointer-Generator Networks

See et al. (2017) propose a similar idea to copy words from input x tooutput y in text summarization. The probability of F ∈ Ybeing thenext word is

?(HC = F | ·) = �?6(HC = F | ·) + (1 − �)∑G8∈x

�(F = G8)C ,8 (5)

Among the many implementation differences with (Gu et al., 2016),the work uses

I a soft gate � ∈ (0, 1) todecide the probability ofgeneration instead ofcopying

I the probability of copy aword F is defined by theattention weightsassociated with it

10

Copying Mechanism: Pointer-Generator Networks

See et al. (2017) propose a similar idea to copy words from input x tooutput y in text summarization. The probability of F ∈ Ybeing thenext word is

?(HC = F | ·) = �?6(HC = F | ·) + (1 − �)∑G8∈x

�(F = G8)C ,8 (5)

Among the many implementation differences with (Gu et al., 2016),the work usesI a soft gate � ∈ (0, 1) to

decide the probability ofgeneration instead ofcopying

I the probability of copy aword F is defined by theattention weightsassociated with it

10

Memory Modules

One example is proposed in (Clark et al., 2018) for entity-driven text(story) generation, where each memory cell is associated with aparticular entity, to encode entity related information from context

hC−1· · ·

attend

hC · · ·

EMNLP

attend

e��!e�"#!%e#��!e···

where e· is a distributed representation of an entity

I Entity prediction ?(4 = EMNLP) ∝ exp(hTC−1]4eEMNLP +w 5 f (4)),

where f is the surface feature related to this entity (Ji et al., 2017).I Dynamic entity updating e(new)EMNLP ∝ �Ce

(new)EMNLP + (1 − �C)hC , where

�C ∈ (0, 1) determines how much information should be encodedfrom e(new)EMNLP .

11

Memory Modules


hC−1· · ·

attend

hC · · ·

EMNLP

attend

e��!e�"#!%e#��!e···



where f is the surface feature related to this entity (Ji et al., 2017).

I Dynamic entity updating e(new)EMNLP ∝ �Ce(new)EMNLP + (1 − �C)hC , where


11

Memory Modules


hC−1· · ·

attend

hC · · ·

EMNLP

attend

e��!e�"#!%e#��!e···



where f is the surface feature related to this entity (Ji et al., 2017).I Dynamic entity updating e(new)EMNLP ∝ �Ce

(new)EMNLP + (1 − �C)hC , where


11

Latent Representation

Roadmap



Attention Mechanism

Copying Mechanism

Memory Modules





13

Variational Autoencoders in NLG

One-page summary of variational autoencoder (Kingma and Welling,2014)

x

z

Inference Generation

A formulation of variational autoencoders for text generation

h = Encoder(x) (6)z = h + & & ∼N(0, diag(22)) (7)x̃ = Decoder(z) (8)

An impact of using VAE is that it (1) produces a robust encoder forinput x and (2) enriches the hidden space H.

14

Variational Seq2seq Models

An example application of variational autoencoder in languagegeneration

I The mean and variance of latent variable z is computed by thelinear transformations of the last hidden states from the RNNencoder s<

I Other influential ideas from this work are KL cost annealing andadversarial evaluation

(Bowman et al., 2016)15

Conditional VAE

A simple formulation of conditional VAE is proposed by Sohn et al.(2015), which initially was used in computer vision.

x

z

y

In text generation, consider x, y and z are input texts, output texts,and latent representations of input texts

I Generation network: ?�(y | x , z), where z ∼ ?�(z | x)I Inference network: @)(z | x , y)I Example applications: in response generation, latent variable z

indicates whether the next utterance is a generic response (Shenet al., 2017)

16

Generative Adversarial Networks

The basic pipeline of GAN is described in the following pipeline

Input x

Latent variable z Generator �

Discriminator � Prediction

x′

I One goal is to learn a generator � that can generate text x′ withthe same quality as x (in other words, to “fool” the discriminator)

I Potential applications are to adopt the framework as onecomponent in other task-specific generation tasks (e.g., styletransfer)

17

GANs for Generation

As a straightforward application of adversarial learning is to replacethe generator with a sequence-to-sequence model as we discussedbefore, as proposed by Li et al. (2017)

Response y Discriminator �

DecoderEncoderContext x

Prediction

y′

I The decoder is to generate a response y′ based on input context xI The discriminator is to predict whether a response is generated

by humans or the decoder

18

Learning Disentangled Representations

One way to utilize a discriminator is to learn disentangledrepresentations and make sure latent representations encoding theexpected attributes for generation.

An example of learning disentangled representations is proposed byHu et al. (2017), with

I Encoder(z) = Encoder(x)

I Decoderx̂ = Decoder(z, c)

I Discriminatorc = Dis(x̂)

where c encodes the attributes of a text (e.g., sentiment categories,formality). Other work on learning disentangled representationsinclude (Bao et al., 2019; Cheng et al., 2020)

19

Transformers

Vaswani et al. (2017): “Attention mechanism . . . , allowing modeling ofdependencies without regard to their distance in the input or outputsequences.”

Recent applications of transformers simply use them as basic buildingblocks, just like the way of using LSTM

20

Generative Pre-trained Transformers (GPT)

GPT (Radford et al., 2018) is trained simply by predicting the nextwords with

h0 = ]4xC−::C−1 +]?h; = transformer_block(h;−1) ∀; ∈ [1, 12]

?(HC | h=) = softmax (]>h=)(9)

where]4 and]? are the word and position embeddings.

21

Pre-trained Models as Teachers

Besides the generative models like GPT-2 and its variants, we can alsouse BERT for text generation. One example from (Chen et al., 2020b)is that BERT can be used as a teacher model to help train asequence-to-sequence models for better output probability

22

Structural Information

Roadmap



Attention Mechanism

Copying Mechanism

Memory Modules





Neural Templates

Syntactic

Semantic

24

Generation with Neural Templates

An extension of variational antoencoders is to incorporate sequentialinformation in latent variables. For example, Wiseman et al. (2018)propose a semi-Markov model on the latent variable sequencez = (I1 , . . . , I=) to capture dependency among adjacent words fortext-to-data generation.

A neural hidden semi-Markov modeldecoderI Transition distribution ?(IC+1 | IC , x)I Length distribution ?(;C+1 | IC+1)I Emission distribution?(yC−;C+1:C | IC = :, ;C = ; , x) 25

Graph-to-Sequence Generation

An example of inputs is an AMR graph, where the task is to generatea text with the same meaning as the input graph.

I The whole model is a layer-wise LSTM, within each layer LSTMruns on the linearized graph, and between layers the hiddenstates are updated with neighbor nodes from the previous layer

I The idea of linearizing an AMR graph (using depth-first search)is proposed by Konstas et al. (2017)

I Other ideas of incorporating AMR graphs: Graph neuralnetwork (Chen et al., 2020a)

26

Dialogue Generation with Semantic Exemplars

A straightforward application of GPT is the response generationproposed by Gupta et al. (2020). The prediction is still done byword-by-word prediction, where the input sequence and outputsequence are 1

x = (Dialogue context,Response frames,Response text)y = (〈MASK〉,Response frames,Response text)

(10)

1A semantic frame example Perception: hear, say, see, smell, feel27

Summary



Attention Mechanism

Copying Mechanism

Memory Modules





Neural Templates

Syntactic

Semantic

28

Reference I

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In InternationalConference on Learning Representations. arXiv: 1409.0473.

Bao, Y., Zhou, H., Huang, S., Li, L., Mou, L., Vechtomova, O., Dai, X.-y., and Chen, J. (2019). Generating Sentences from DisentangledSyntactic and Semantic Spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6008–6019,Florence, Italy. Association for Computational Linguistics.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating Sentences from a Continuous Space. InProceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany. Association forComputational Linguistics.

Chen, M., Tang, Q., Wiseman, S., and Gimpel, K. (2019). Controllable Paraphrase Generation with a Syntactic Exemplar. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics, pages 5972–5984, Florence, Italy. Association for ComputationalLinguistics.

Chen, Y., Wu, L., and Zaki, M. J. (2020a). Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation.arXiv:1908.04942 [cs]. arXiv: 1908.04942.

Chen, Y.-C., Gan, Z., Cheng, Y., Liu, J., and Liu, J. (2020b). Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the58th Annual Meeting of the Association for Computational Linguistics, pages 7893–7905, Online. Association for Computational Linguistics.

Cheng, P., Min, M. R., Shen, D., Malon, C., Zhang, Y., Li, Y., and Carin, L. (2020). Improving Disentangled Text Representation Learningwith Information-Theoretic Guidance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages7530–7541, Online. Association for Computational Linguistics.

Clark, E., Ji, Y., and Smith, N. A. (2018). Neural Text Generation in Stories Using Entity Representations as Context. In Proceedings of the 2018Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (LongPapers), pages 2250–2260, New Orleans, Louisiana. Association for Computational Linguistics.

Gu, J., Lu, Z., Li, H., and Li, V. O. (2016). Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

Gupta, P., Bigham, J. P., Tsvetkov, Y., and Pavel, A. (2020). Controlling Dialogue Generation with Semantic Exemplars. arXiv:2008.09075 [cs].arXiv: 2008.09075.

Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. (2017). Toward Controlled Generation of Text. volume 70 of Proceedings ofMachine Learning Research, pages 1587–1596, International Convention Centre, Sydney, Australia. PMLR.

29

Reference II

Ji, Y., Tan, C., Martschat, S., Choi, Y., and Smith, N. A. (2017). Dynamic Entity Representations in Neural Language Models. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 1830–1839, Copenhagen, Denmark. Association forComputational Linguistics.

Kingma, D. P. and Welling, M. (2014). Auto-Encoding Variational Bayes. In ICLR. arXiv: 1312.6114.Konstas, I., Iyer, S., Yatskar, M., Choi, Y., and Zettlemoyer, L. (2017). Neural AMR: Sequence-to-Sequence Models for Parsing and

Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages146–157, Vancouver, Canada. Association for Computational Linguistics.

Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Copenhagen, Denmark. Association forComputational Linguistics.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training.Rush, A. M., Chopra, S., and Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015

Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for ComputationalLinguistics.

See, A., Liu, P. J., and Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Associationfor Computational Linguistics.

Shen, X., Su, H., Li, Y., Li, W., Niu, S., Zhao, Y., Aizawa, A., and Long, G. (2017). A Conditional Variational Framework for DialogGeneration. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages504–509, Vancouver, Canada. Association for Computational Linguistics.

Sohn, K., Lee, H., and Yan, X. (2015). Learning Structured Output Representation using Deep Conditional Generative Models. In Advancesin neural information processing systems, pages 3483–3491.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need.In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in NeuralInformation Processing Systems 30, pages 5998–6008. Curran Associates, Inc.

Wiseman, S., Shieber, S., and Rush, A. (2018). Learning Neural Templates for Text Generation. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing, pages 3174–3187, Brussels, Belgium. Association for Computational Linguistics.

30

Contextual InformationLatent RepresentationStructural InformationReferences

The Amazing World of Neural Language Generation - Part II ... · The Amazing World of Neural...

Documents

Transcript of The Amazing World of Neural Language Generation - Part II ... · The Amazing World of Neural...