Discriminator Networks Co-opNet arXiv:1907.01272v1 [cs.CL ... · Discriminator Networks (Co-opNet...

Co-opNet: Cooperative Generator–Discriminator Networksfor Abstractive Summarization with Narrative Flow

Saadia Gabriel♠ Antoine Bosselut♠♦ Ari Holtzman♠♦ Kyle Lo♦Asli Celikyilmaz ♣ Yejin Choi♠♦

♠Paul G. Allen School of Computer Science & Engineering, University of Washington♣Microsoft Research ♦Allen Institute for Artificial Intelligence

{skgabrie,antoineb,ahai,yejin}@cs.washington.edu{kyleclo}@uw.com {aslicel}@microsoft.com

AbstractWe introduce Cooperative Generator-Discriminator Networks (Co-opNet), ageneral framework for abstractive summariza-tion with distinct modeling of the narrativeflow in the output summary. Most currentapproaches to abstractive summarization, incontrast, are based on datasets whose targetsummaries are either a single sentence, or abag of standalone sentences (e.g., extractedhighlights of a story), neither of which allowsfor learning coherent narrative flow in theoutput summaries.

To promote research toward abstractive sum-marization with narrative flow, we first intro-duce a new dataset, Scientific Abstract Sum-marieS (SASS), where the abstracts are usedas proxy gold summaries for scientific arti-cles. We then propose Co-opNet, a noveltransformer-based framework where the gen-erator works with the discourse discriminatorto compose a long-form summary. Empiricalresults demonstrate that Co-opNet learns tosummarize with considerably improved globalcoherence compared to competitive baselines.

1 Introduction

We study the task of generating abstractive sum-marization with narrative flow: given an in-put document, the distinct goal is to generate aparagraph-length abstractive summary that has aproper narrative flow. Our study contrasts withmost previous work that focused on either ex-tractive document-level summarization (Nenkovaand McKeown, 2012; Allahyari et al., 2017) orabstractive sentence-level summarization (Rushet al., 2015; Grusky et al., 2019; Narayan et al.,2018a), where maintaining a good narrative flowin the output summary was not within the scope ofthe task definition.

Work-in-progress manuscript

Discourse relation characterizes the internal structureand logical relation of a coherent text.

Automatically identifying these relations not only playsan important role in discourse comprehension andgeneration, but also obtains wide applications inmany other relevant natural language processingtasks, such as text summarization (Yoshida et al.,

2014), conversation (Higashinaka et al., 2014),question answering (Verberne et al., 2007) and

information extraction (Cimiano et al., 2005).

Generally, discourse relations can be divided into twocategories: explicit and implicit, which can be

illustrated in the following example: The company wasdisappointed by the ruling . . .

Implicit discourse relation recognition is a crucialcomponent for automatic discourselevel analysis and

nature language understanding.

In this paper, instead, we explore generative modelsand propose a variational neural discourse relation

recognizer.

Introduction

Abstract

Inter-sentence narrative flow

Introductory statements:"In this paper" , "In this work" , "In the present study"

Body of intro, citations

More detailed description of task

Introduction of task

SummaryGeneration with

Discourse-AwareDiscriminator

Discourse-aware transformer architecture

Figure 1: Structure for introduction → abstract scien-tific paper generation.

Our study contrasts also with recent work to-wards abstractive summarization at the document-level. Importantly, most such studies could notdirectly model or evaluate abstractive summariza-tion with narrative flow in the output summary.This is largely due to the inherent limitations of theexisting datasets; the reference summaries avail-able in most commonly used large-scale datasets,such as CNN/DailyMail dataset (Hermann et al.,2015), are mainly the headlines of the newsar-ticles or stories, which are often sets of discon-nected sentences. These summaries neither pro-vide the inductive bias for models to learn the de-sired narrative flow in output summaries, nor en-able researchers to measure the quality of narrativeflow using reference summaries (Chen and Bansal,

arX

iv:1

907.

0127

2v1

[cs

.CL

] 2

Jul

201

9

2018). This lack of proper inductive bias also im-plies that the learned models often exhibit extrac-tive tendency. They do not learn the abstractivegeneration capability necessary to achieve coher-ent narrative flow in the output summary (Hoanget al., 2019).

Accordingly, we present a new large-scaledataset, Scientific Abstract SummarieS (SASS), topromote research toward abstractive summariza-tion with narrative flow. SASS provides over700k samples to support three distinct levels ofabstractive summarization formulations: (1) intro-to-abstract, (2) abstract-to-title, and (3) intro-to-title. The first task formulation, intro-to-abstract,supports unique opportunities to study summarieswith narrative flow, as abstracts in scientific pa-pers are structured with highly coherent discourseflow. In addition, scientific paper abstracts main-tain loose, therefore abstractive alignments withrespect to the introduction (Figure 1), which ischallenging for current models to learn to abstractrather than extract.

We then introduce Cooperative Generator-Discriminator Networks (Co-opNet), a newmodeling framework for abstractive summariza-tion with distinct modeling of the narrative flow inthe output summary. In this framework, the gen-erator, based on transformer language models thatare fine-tuned for abstractive summarization, pro-poses a pool of candidate summaries. Then, thediscriminator, also based on transformers, selectsthe summary that has the best narrative flow acrossadjacent sentences. Our work presents the firststudy that adapts the learning-to-write frameworkof Holtzman et al. (2018), originally proposed foropen-ended text generation, to abstractive summa-rization, along with comprehensive performancereports on the new SASS dataset.

Comprehensive empirical results demonstratethat Co-opNet learns to summarize with a con-siderably improved global coherence compared tocompetitive baselines. Based on the recently pro-posed BERTScore (Zhang et al., 2019a), in partic-ular, our model outperforms all other models by3.98 points on BERTScore-R. In addition, humanjudgments demonstrate that domain experts preferCo-opNet over the base summarization model inover 64% of cases.

The rest of the paper is organized as follows.We first describe the generator network and thediscriminator network in §2 and §3 respectively.

We then describe how we combine the two intoa cooperative generator-discriminator network in§4. We introduce our new dataset in §5 followedby an empirical study and analysis in §6 and §7.We discuss related work in §8 and conclude in §9.

2 Generator Networks

We use the transformer architecture of Radfordet al. (2019) as our generator’s architecture. Fol-lowing the work of Liu et al. (2018), we adapt alanguage model to the task of abstractive summa-rization by concatenating the article a, a delimitertoken d, and the summary s into one fixed-lengthinput x = (a1, ..., a|a|,d, s1, ..., s|s|), where |a| isthe length of the gold article and |s| is the lengthof the gold summary.

The transformer has the same block architectureas the model of Radford et al. (2019) and consistsof L attention blocks consisting of self-attentionand feed-forward layers. At each time step i, themodel produces an output probability distributionover the vocabulary for the next token wi given allprevious output tokens w<i. For any arbitrary to-ken wj preceding wi, the per-layer representationof that token is computed in the following way:

h0i =We(wj) + pj (1)

hlj = block({h}l−1≤j ) (2)

where We is a word embedding matrix, pj is theposition embedding, h0i is the initial representa-tion, and {h}l−1≤j is the set of all preceding layerblock outputs for tokens up to wj . Finally, for thecurrent position i in the sequence:

P (wi) = log(softmax(hLi We)) (3)

where We is the same word embedding matrix asin Equation 1 and hLi is the final layer transformerblock output at position i.

The model is trained to minimize the negativeloglikelihood of the next word wi given all pre-ceding words:

Lgen = −|a|+|s|∑j=1

log p(wi|w0, ...wi−1) (4)

where wi is the ith token of x. At test time, xonly consists of the gold article and delimiter to-ken (a1, ..., a|a|,d) and we decode generated sum-maries g starting from this input.

Generationpool ofhypothesissummaries

S1 S2 S3 Sj. . .

g1 gn. . .

. . .

generatedtokens

generatedtokens Summarization

ModelProbabilitiesp1, p2, ... , pn

S1 Sentence Pairs

S1 . . .

S2

Sj-1

Si-1. . .

S1 S2 S3 Si. . .

Si

S2 Sj

TransformerPairwiseAdjacencyDiscriminator

Adjacency Scoresa1, a2, ... , anFinal

predictedsummary Reranking

Transformer Summarization

Architecture

InputContext +Pos Embed

. . .

. . .

Source Document

Figure 2: Model Architecture

3 Discriminator Networks

Because the autoregressive nature of the generatormakes it unlikely to achieve narrative flow acrosslong time horizons, we incorporate a discriminatormodel into the decoding process. Due to the diffi-culty of explicitly defining discourse properties todiscriminate between generations with good andbad narrative flow, we rely on a parametrized scor-ing function to approximate this discourse prop-erty by scoring whether pairs of adjacent sentencesare consistent with one another.

3.1 Discriminator Architecture

Sentence Pair Representation To model thelikelihood of adjacency between two sentences suand sv of length |u| and |v| respectively, we firstcompute a hidden representation of the sentencepair using BERT (Devlin et al., 2019). This rep-resentation allows us to better capture the fine-grained contextual information necessary for un-derstanding the relationship between su and sv.

The initial input to the encoder is the concate-nation of these sentences: s = [CLS] + su +[SEP ] + sv + [SEP ], where [CLS] is a spe-cial token associated with the task and [SEP ] isa sentence delimiter token. As in Devlin et al.(2019), each word in the sequence is encoded bya word embedding wi and positional embedding

pi that identifies its position in the s. Addition-ally, we have a learned segmentation embeddingqi for each token in s to indicate which of the twosentences, su or sv, the word belongs to. There-fore, our full input to each of the encoder layers isE = (e1, ...., e|u|+|v|+3), where ei = wi + pi + qi.We encode these representations using BERT (De-vlin et al., 2019) and the final contextual represen-tation used for adjacency prediction is the pooledhidden representations of the [CLS] token, whichwe denote as hcls.

Adjacency Classification Once the sentencepair s=(su,sv) has been encoded as the hidden rep-resentation hcls, we obtain the probability of ad-jacency between the pair of sentences (su and sv)by a linear projection (Wdisc) followed by a log-softmax:

Padj(s) = softmax(Wdischcls) (5)

In § 4, we describe how the adjacency scores areused to re-rank candidate summaries.

3.2 Discriminator Training

Sentence Selection for Discriminator ModelsTo train a discriminator model, we use a subsetof adversarial and positive sentence pair exam-ples extracted from the training set. The sentence

pairs are extracted from gold abstracts contain-ing at least five sentences using the following ap-proach: For a randomly selected sentence su fromthe abstract, we randomly select an adjacent sen-tence, su−1 or su+1, for a positive example andany nonadjacent sentence sv/∈[u−1,u+1] for a nega-tive example.

Adjacency Learning Objective We define thetraining objective for the adjacency discriminatorto minimize the negative loglikelihood of predict-ing whether two sentences are adjacent or not:

Ldisc = −1

N

N∑(ADJ(s) · log(Padj(s))

+ (1− ADJ(s)) · log(1− Padj(s)))

(6)

where ADJ(s) is an indicator function for whetherthe two sentences in s are adjacent and Padj(s)is the discriminator’s estimate of adjacency fromEquation 5.

4 Cooperative Generation

The discriminator reranks candidate summarygenerations based on the overall likelihood of sen-tences within the summary being adjacent. Morespecifically, for each candidate generated sum-mary g and sentence pair s = (su, sv) contained ing, we define the likelihood of sentence adjacencyas the score that the discriminator assigns to thatsentence pair (Equation 5). By defining discourseusing this approach, we place minimal restrictionson the model’s ability to hone in on patterns per-taining to desirable narrative flow.

To incorporate this objective into our summa-rization framework, we use a modified decodingobjective function. First, we generate a pool ofcandidate summaries from the base summarizationmodel (§2) using an arbitrary decoding strategy(e.g., beam search, nucleus sampling, top-k sam-pling). Then, the discriminator is used to re-rankthese candidates. Specifically, for each generatedsentence su present in a candidate summary g oflength |g| tokens with S sentences, we maximizeboth the language model probability of each tokenp(wi|w<i) and the probability p(su, su−1) of subeing adjacent to the previous sentence su−1:

p(g) = δgen

|g|∑i=1

p(wi|w1, ...wi−1)

+ δdisc

S∑u=1

Padj(su, su−1) (7)

where δgen and δdisc are hyper-parameters con-trolling the contribution of the generator and dis-course discriminator to the final predicted sum-mary generation. The score Padj(su, su−1) is theadjacency score computed by the discourse dis-criminator model (Equation 5).

5 Data

In the following section, we introduce SASS, adataset of over 700K introduction-abstract pairsfrom arXiv 1. We also describe AAN, an existingdataset of NLP papers published at top venues2.These datasets rely on greater discourse structureand abstractiveness than previous summarizationcorpora. We specifically choose to focus on scien-tific papers incorporating a wide range of domainknowledge and subjects to rigorously test the gen-eralization of our models across different fields ofscientific endeavour.

5.1 Datasets

SASS The dataset is crawled from arxiv.organd contains over 700k introduction-abstract pairsfrom scientific articles. In our experiments we pri-marily focus on the CS3 and Bio4 domain sub-sets. The task in SASS presents a challenge toexisting summarization models, since it requiresmodels to learn relevant domain knowledge for thescientific domain of interest, as well as recognizecommon discourse structure for papers written inthat domain. While the work in this paper focusesmainly on Introduction → Abstract summariza-tion, we also note that we extracted text for twoother summarization tasks (Introduction → Title,and Abstract→ Title) and provide results for thoseAppendix A.

AAN In addition to the SASS dataset, we in-clude an existing dataset of scientific articles thatfocuses on papers in the NLP computer science

1https://arxiv.org2https://www.aclweb.org/anthology3https://arxiv.org/corr4https://arxiv.org/archive/q-bio

arxiv.org

https://arxiv.org

https://www.aclweb.org/anthology

https://arxiv.org/corr

https://arxiv.org/archive/q-bio

Dataset Narrative Flow ? # Summaries Avg # Sents Avg # Words

XSum (Narayan et al., 2018a) 7 226,711 1.00 23.26Newsroom (Grusky et al., 2019) 7 1,321,995 1.45 26.70CNN (Hermann et al., 2015) 7 92,579 3.59 45.70DailyMail(Hermann et al., 2015) 7 219,506 3.86 54.65SASS 3 472,493 6.11 150.85AAN 3 11,890 5.03 106.76

Table 1: Statistics of gold summaries in different summarization datasets.

domain. This dataset consists of 12k paper subsetfrom the ACL Anthology Network (AAN) (Radevet al., 2009), found after removing articles fromthe anthology without abstracts and duplicates. Aswith SASS, we define the task for AAN as gener-ation of abstracts from introductions.

5.2 Narrative Flow Analysis

Since the focus of this work is on generating sum-maries with more coherent narrative flow, we con-centrate on datasets requiring narrative structure togenerate good summaries. Particular attributes ofthese datasets that connect to discourse structureare:

• Length of summaries → Are the summarieslong enough to clearly show narrative flowproperties?

• Abstractiveness of gold summaries→ Do thesummaries exhibit particular sentence-levelflow, or are the summary sentences extractedhighlights from the context?

As can be seen in Table 1, SASS and AANhave properties missing from existing summariza-tion datasets based on Newswire data. XSum(Narayan et al., 2018a) and Newsroom (Gruskyet al., 2019) are generally too short for the av-erage summary to exhibit cross-sentence narra-tive flow. Meanwhile, CNN/DailyMail (Hermannet al., 2015) summaries are acquired by concate-nating extracted highlights, which can be unre-lated, making it an unsuitable dataset for cap-turing narrative flow of summaries. Conversely,SASS and AAN, which use scientific abstracts assummaries, are both long enough to have multiplesentences, and generally exhibit strong discoursepatterns typical to scientific writing, making themideal testbeds for narrative flow quality in abstrac-tive summarization.

6 Experimental Setup

In this section we outline comparison baselinesand describe experimental setups for our genera-tor and discriminator models.

6.1 BaselinesWe train a 2-layer bi-LSTM sequence-to-sequencemodel with attention. The bi-LSTM is used to en-code a given source article a and a separate de-coder LSTM produces the generated summary g.At each decoding time step, the decoder attendsto all the context vectors produced by the encoderas well as the maintained state from the previousdecoder tokens to produce the next token in thesummary. We also implement a Pointer-Generator(PGEN + Coverage) model (See et al., 2017) thatextends the base LSTM model (LSTM + Cover-age) to allow tokens to be copied from the inputduring generation. Baselines are trained for up to40000 steps with a batch size of 16. Followingprevious work, we decode from these baselines us-ing beam search with a beam size of 4. We seta maximum decoding length of 200 tokens. TheRNN baselines additionally have a minimum de-coding length of 35 tokens as in See et al. (2017).

6.2 Generator ModelInput We perform word and sentence-level tok-enization using spaCy and NLTK (Loper and Bird,2002). Because of the fixed input size introducedby the transformer language model (Radford et al.,2019), the input context is truncated to a maxi-mum of 800 tokens and summaries are truncatedto a maximum of 200 tokens.Implementation All our models are implementedin Pytorch. Our code is adapted from the Ope-nAI Huggingface implementation of the 117M pa-rameter GPT-2 language model5and uses the pre-

5https://github.com/huggingface/pytorch-pretrained-BERT

trained weights of Radford et al. (2019).Training We use a learning rate of 2e-5 for fine-tuning. We use a batch size of 1 with gradientaccumulation to simulate a batch size of 16. Onthe AAN subset, we train the base summariza-tion transformer model for 11 epochs. On theSASS CS and Bio subsets, we train the base sum-marization transformer model for 8 epochs. Allexperiments are run on a Titan-X GPU. Train-ing time for the AAN and SASS Bio datasets isabout 30 minutes per epoch. Training time for theSASS CS dataset is 2.5 hours per epoch.

6.3 Discriminator Model

Input We use a max sentence length of 200 to-kens to accommodate the fixed input size of BERT(512 tokens), reduce inference time, and discour-age the model from generating abnormally longrun-on sentences that indicate the presence of co-herence issues.Implementation The discriminator models areadapted from the Huggingface implementationof the BERT next sentence prediction classifier6.We initialize the 12-layer BERT-base model withthe pretrained weights of the SciBERT-uncasedmodel, which was originally trained on 1.14 mil-lion scientific papers (Beltagy et al., 2019).Training We fine-tune the discriminator using alearning rate of 2e-5 and batch size of 2. We traintwo discriminators: one is trained on AAN for de-coding both SASS CS and AAN, while the otherdiscriminator is trained on SASS Bio and used ex-clusively for decoding that subset. All discrimina-tor models are trained for 17 epochs on a Titan-XGPU over a single day.

6.4 Generation Hyperparameters

During inference time, we use top-k samplingwith k=4 (Fan et al., 2018b) to generate 10 candi-date summaries for each model. In the re-rankingobjective (Equation 7), we weigh the generationand discriminator models equally for all experi-ments by setting δgen = δdisc = .5 and select thecandidate out of the 10 that achieves the highestjoint score in Equation 7 to be the final summary.We filter candidate summaries from the hypothesisgeneration pool that contain sentences longer thana fixed max length of 200 tokens, a clear sign ofcoherence deterioration.

6https://github.com/huggingface/pytorch-pretrained-BERT

Criteria % Co-opNet % Base % No Preference

Flow 64.2 26.4 9.4Relevance 45.3 35.8 18.9

Overall 60.3 34.0 5.7

Table 2: Expert Human Evaluation Results. The basemodel is the transformer model that does not use thediscriminator to evaluate its generations. Our model isthe transformer model that uses discriminator.

7 Experiments

Recent work in abstractive summarization has re-ported issues with reference-based automatic met-rics such as ROUGE (Hoang et al., 2019; Sunet al., 2019; Clark et al., 2019; Kilickaya et al.,2017). Due to these concerns, we first discuss hu-man evaluation in §7.1 before reporting automaticmetrics in §7.2. Finally, in §7.3 we explore model-based alternatives to traditional n-gram based met-rics.

7.1 Human EvaluationSince coherence of generated text is difficult tomeasure automatically, we conduct human evalua-tions to evaluate how the discriminator affects gen-eration quality. We randomly sample 34 abstractsfrom the AAN test set of NLP scientific papers.Then, we present a side-by-side blind comparisonof the Base-generated version of the abstract andCo-opNet-generated version along with the goldintroduction context to 1 of 10 in-domain expertswhose experience in the field of NLP ranges from3 to 10 years. To reduce bias, the ordering of gen-erated abstracts is also randomized and experts arenot told that the abstracts are machine-generated.Each pairwise comparison is evaluated by threeunique experts. The experts are asked to assessgeneration quality based on three key criteria:

• Flow → Which abstract does a better job ofpresenting a coherent summary that displayscorrect discourse properties?

• Relevance → Which abstract does a betterjob of summarizing the main ideas presentedin the gold introduction?

• Overall→Which abstract is better overall?

Each expert casts a vote for which abstract is pre-ferred on each of these criteria or “No Preference,"if there is no distinguishable difference betweenthe abstracts based on a given criteria.

https://github.com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/pytorch-pretrained-BERT

Model R-1 R-2 R-L

Lede-3 27.12 6.62 23.88LSTM + Coverage 27.80 5.57 18.02PGen + Coverage (See et al., 2017) 39.85 12.83 23.24Base (topk, k = 4) 33.66 9.37 29.95Co-opNet (topk, k = 4) 39.66 10.69 35.47

Table 3: ROUGE scores for generating abstracts forscientific paper introductions for the AAN dataset

Model R-1 R-2 R-L

Lede-3 28.22 7.06 16.22LSTM + Coverage 22.74 4.56 20.64PGen + Coverage (See et al., 2017) 36.68 11.74 32.55Base (topk,k=4) 37.32 9.62 33.74Co-opNet (topk, k=4) 38.92 10.19 35.45

Table 4: ROUGE scores for generating abstracts forscientific paper introductions for the CS portion of theSASS dataset

Results The results of this expert human eval-uation (shown in Table 2) show that Co-

opNet clearly improves the quality of generatedabstracts over the Base transformer model. Inparticular, Co-opNet is selected as best in 64.2% of cases for the Flow criteria, while the rele-vance score is less clearly superior, indicating thatimproved discourse is the primary reason for theOverall preference.

7.2 Traditional Automatic Evaluations

To match previous work on summarization, weuse the ROUGE metric (Lin, 2004) for auto-matic evaluation Specifically, we report ROUGE-1, ROUGE-2 and ROUGE-L F-1 scores. Table3 shows the results of our experiments on theAAN dataset. The results of experiments on theSASS CS and Bio subsets are also shown in Ta-bles 4 and 5, respectively. Notably, the modelusing the discriminator (Co-opNet) outperformsthe generator-only (Base) model across ROUGEmetrics on all datasets except for ROUGE-2 on theBio and CS SASS datasets. Co-opNet also out-performs the baseline models on the SASS sub-sets. On the more domain-specific AAN subset,results are mixed, as the PGen + Coverage base-line (See et al., 2017) achieves higher performanceon ROUGE-1 and ROUGE-2, but our model isover 12% better on ROUGE-L, which is gener-ally considered the best summarization metric.Our hypothesis is that performance gains achievedwith the inclusion of the SciBERT discriminator

Model R-1 R-2 R-L

Lede-3 27.60 5.70 24.21LSTM + Coverage 10.73 0.49 9.94PGen + Coverage (See et al., 2017) 23.74 4.48 21.65Base (topk,k=4) 36.30 7.99 32.87Co-opNet (topk, k=4) 36.42 7.98 33.18

Table 5: ROUGE scores for generating abstracts forscientific paper introductions for the Bio portion of theSASS dataset

Model R-L P R-L R R-L F1

Base (topk,k=4) 78.41 9.67 16.71Co-opNet (topk, k=4) 75.54 16.61 26.22

Gold Abstract 68.03 17.48 26.87

Table 6: ROUGE-L comparison for generated andgold AAN abstracts compared to introductions. TheROUGE-L precision measures the average extractive-ness of the abstracts.

are caused by three main factors:

Flow Since SciBERT was fine-tuned on a largecorpus of scientific papers, the model introducesexternal contextual information about scientificdocuments that enhances the ability of the summa-rization model to distinguish between “good” and“bad” narrative flow for in-domain text. Addition-ally, Table 6 shows that abstracts generated fromCo-opNet achieve a ROUGE-L precision scorethat is 2.87 points lower than the score for ab-stracts generated from the base model when usingthe introduction as a reference. This indicates us-ing a discriminator leads to more abstractive sum-maries than the base model, as less of the summarycan be pulled in the same order from the document(Hoang et al., 2019).

Content The discriminator encourages selectionof more contentful generations that include salientkeyphrases and domain-specific information. Ta-ble 6 shows that abstracts generated from the basemodel achieve a ROUGE-L recall score that is6.94 points lower than the score for abstracts gen-erated from Co-opNet. This indicates that Co-opNet can pull more relevant content from theintroduction compared to the base model.

Repetition The adjacency task modeled by thediscriminator assigns a high likelihood of choos-ing strongly correlated sentences that follow nat-urally from each other, but are not exact copies

Model BERT-P BERT-R BERT-F1 SciBERT-P SciBERT-R SciBERT-F1

PGen + Coverage (See et al., 2017) 60.44 55.88 57.86 61.15 57.53 59.13Base (topk,k=4) 63.17 52.25 57.00 65.27 56.08 60.20Co-opNet (topk, k=4) 60.66 56.23 58.21 62.03 58.51 60.11

Table 7: BERTScore results for AAN subset using both BERT and SciBERT as the evaluation model.

or paraphrases of one another. This can be at-tributed to the fact that adjacent sentences tendto contain related information instead of repeatinginformation, reducing the overall repetitiveness ofdiscriminator model generations.

7.3 Alternatives to ROUGEDespite our generally superior results on theROUGE metric, past work (Schluter, 2017; Hoanget al., 2019; Sun et al., 2019) has explored the lim-itations of ROUGE for evaluating summarizationtasks, including the issue that ROUGE measuresn-gram overlap without distinguishing betweensalient n-grams and non-contentful tokens. Thisfailure means that models with a higher probabil-ity of generating generic, frequent terms such as“the” and “how” can potentially outperform mod-els that better capture conceptual information, butmay paraphrase rather than extract common terms.

To overcome these limitations, we explore us-ing contextualized evaluation to measure how wellmodels learn to extract conceptual meaning fromscientific paper introductions and generate high-quality abstracts. The BERTScore metric (Zhanget al., 2019a), which has been shown to be wellcorrelated with human judgments on other NLPtasks, is one such metric that measures similaritybetween contextual embeddings of the generatedsummary and reference summary. For a generatedand gold abstract, we first convert the sequencesof tokens into sequences of contextual vector rep-resentations using either the uncased version ofBERT-base or the SciBERT variant. Following(Zhang et al., 2019a), we then compute precision,recall and F1 scores from these contextual repre-sentations.

Results Our results on the BERTScore evalua-tion reinforce our observations from the humanand ROUGE score evaluations. Co-opNet gen-erally outperforms the PGen + Coverage base-line across all metrics, regardless of the evalua-tion model used (BERT, SciBERT). eResults aremore mixed when comparing Co-opNet to thebase model with no discriminator. Overall, Co-

opNet is less precise than the Base model at cap-turing contextual meaning, potentially due to thefact that the Co-opNet tends to produce longersummaries (86.4 tokens vs. 49.11 tokens forBase). This observation is supported by the factthat Co-opNet achieves the best results in termsof recall, beating the Base model by 3.98 points onBERT-R and 2.43 points on SciBERT-R.

8 How does the discriminator improvecoherence?

Abstracts generated from the non-discriminatormodels tend to lack completeness. While gener-ations from these models are adept at introducingthe task presented in the scientific paper, they donot provide a full summary of the paper’s contents.Generations from the Base and PGen + Coveragelack details about final results and end abruptly in-stead of coming to a natural conclusion. As shownby Figure 3, the Base and PGen + Coverage gen-erations are also often over-specific (for example,mentioning the CRAFT corpus without specifyingthat it is a set of published scientific articles). Incontrast, Co-opNet-generated summaries have aclear narrative flow that capture each key part ofa coherent abstract: 1) Introduction of the task,2) Methods used to address this problem, and 3)Main findings and results. Despite overcomingthese limitations to modeling discourse structureand coherence, our analysis reveals two key typesof errors in Co-opNet generations:Semantic Repetition Despite improvements in re-ducing repetition over baselines, Figure 3 providesone example of how the model repeats informationwithin the same sentence without exact copying –generating “blogs", then “blog posts."Veracity Co-opNet also hallucinates inaccurateinformation like the nonexistent term “bi-diag sec-tions," or erroneous acronyms like “(TEU)." Thisindicates the model could still benefit from moreexternal domain knowledge for grounding.These findings indicate that the candidate sum-maries selected by the discriminator still pose aninteresting challenge for future research on im-

Vagueness

Veracity

Repetition

No Discriminator With Discriminator

In this paper, we describe the rhetorical roles ofsentences in the craft corpus, a set of full-text papersthat we have annotated using the cisp schema.

Hand alignment of the resulting annotationssuggests that patterns in these cisp-annotatedsentences correspond to common argumentativegambits in scientific writing.

Base Transformer Model

Pointer-Generator + Coverage Model

Transformer + DiscriminatorIn this paper, we analyze the role of rhetoricalrepresentations in taxonomic parsing of text.

In the CRAFT corpus, sentence distributional modelswith annotated data consistently outperform modelswith no annotated data.

We present an approach forsystematically annotating, in a corpus offull-text published scientific articles,rhetorical roles of sentences in scientificwriting....

These hierarchical approaches can bethought of as hierarchical output (TEU)annotation techniques...

Specifically, in this paper, we proposemethods for generating dependencyrelations with bi-directional LFG.

The empirical evaluation of our methodsshows that they can be applied to partsof the scientific literature such asabstracts, bi-diag sections,blogs, and blog posts.

Figure 3: Analysis of common errors and coherence issues in generations of scientific abstracts.

proving quality of generated scientific summaries.

9 Related Work

Generation with Narrative Flow Due to the needfor accurate understanding of long-distance de-pendencies and narrative structure, modeling co-herent narrative flow has proved to be a majorchallenge in the field of text generation, particu-larly for scientific documents (Koncel-Kedziorskiet al., 2019; Nikolov et al., 2018). A numberof solutions have been proposed in recent yearsfor improving coherence in text generation, in-cluding global-tracking of entities and discourse-aware neural rewards (Kiddon et al., 2016; Bosse-lut et al., 2018; Holtzman et al., 2018; Fanet al., 2018b). In particular, the work of Cohanet al. incorporates narrative structure into Pointer-Generator networks (See et al., 2017) by using adiscourse-aware attention mechanism for abstrac-tive summarization of scientific papers. We ex-pand upon this previous work, using the globalcontext provided by our Transformer-based Coop-erative Generator-Discriminators to better capturefar-distant information useful for learning “good"narrative flow.Neural Abstractive Summarization In the past,abstractive summarization models (Rush et al.,2015; Chopra et al., 2016; Gehrmann et al., 2018)have relied upon a seq2seq encoder-decoder ar-chitecture that follows the generation frameworkof (Sutskever et al., 2014). Recently, new chal-lenges in abstractive summarization such as topic-

aware and controllable summarization, have en-couraged exploration of other model architectureslike convolutional neural networks (CNNs) (Alla-manis et al., 2016; Fan et al., 2018a; Narayan et al.,2018b) as an alternative to RNNs. Reinforcement-learning based approaches (Celikyilmaz et al.,2018; Chen and Bansal, 2018) have also enhancedthe overall quality and conciseness of abstractivesummaries. In addition to CNNs, Transformermodels have emerged as a promising architecturefor text generation and achieved state-of-the-artresults across several NLP tasks (Vaswani et al.,2017; Radford et al., 2019). Liu et al. proposeda decoder-only transformer model for summariza-tion across multiple Wikipedia articles and sev-eral later works have explored Transformer-basedarchitectures for abstractive summarization overNewswire (Khandelwal et al., 2019; Zhang et al.,2019b; Hoang et al., 2019). While our modelbuilds upon this work, to our knowledge, it is thefirst Transformer summarization model to explic-itly model narrative flow in text.

10 Conclusion

In this work, we introduced CooperativeGenerator-Discriminator Networks, a frameworkfor more coherent natural language generationwith transformer language models through theintegration of a discriminator that can discerngood narrative flow. We showed the limitationsof current automatic metrics for evaluatingsummarization at capturing improvements in

Gold This research is concerned with making recommendations to museum visitors based on theirhistory within the physical environment, and textual information associated with each item intheir history. (...) This study compares and analyses different methods of path predictionincluding an adapted naive Bayes method, document similarity, visitor feedback and measuresof lexical similarity.

PGen + Coverage In this paper we study the problem of recommendation systems for the task of visitors toan information rich environment such as a museum, are interested to convey predictions aboutwhat theme or topic a given visitor is interested in. We are interested to get the most out of theirvisit. As a behaviour to evaluate the relative impact of different factors in the physical space.(...)

Base This paper proposes a novel approach to measuring the success of machine learning methodsin a user’s selection of a particular exhibit to be produced. An unsupervised framework is usedto jointly compute the likelihood of the value of the best exhibit to be produced. (...) Theexperiments show that models produced by supervised methods improve user performance inselecting exhibits over unsupervised methods.

Co-opNet This paper presents a user-centric perspective on the property of location, focusing on somerelevant factors in deciding which exhibit a user intends to visit. We exploit variation andinfrequency in data from the Victoria/Ausland Courtyard (VCA) Collection (...) We make threecontributions: (1) Our experimental system provides empirical evidence for the effectivenessof supervised learning techniques in predicting a user’s audience behavior; (2) Our structurebased method allows unsupervised learning to be applied to multiple sets of related information.(3) Our experimental system uses unsupervised model adaptation in a supervised setting.

Table 8: Example of gold and generated abstracts from baseline Pointer Networks + Coverage (See et al., 2017)(PGen + Coverage) and two of our proposed models, Base (topk,k=4) and Co-opNet (topk, k=4), on the NLPscientific domain. Coherency issues in the PGen + Coverage generated abstract, including incorrectly structuredsentences and lack of concluding details, are highlighted in red. We highlight transitional phrases that contributeto coherent flow by properly delineating sections of abstracts in purple.

coherence, and proposed a new evaluation setupfor summarization that takes into account con-textual similarities between summaries. Throughthese analyses and eliciting human judgments, weempirically show that the discriminator model isbetter at selecting generations that are both morerelevant and more narratively coherent.

Acknowledgments

We thank Lianhui (Karen) Qin, Jungo Kasai, RikKoncel-Kedziorski, Jan Buys and Rowan Zellersfor helpful discussions. We also thank the do-main experts who contributed to the expert hu-man evaluations in this work. This research wassupported in part by NSF (IIS-1524371), DARPACwC through ARO (W911NF-15-1-0543), andSamsung AI Research.

ReferencesMehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi,

Saeid Safaei, Elizabeth D Trippe, Juan B Gutier-rez, and Krys Kochut. 2017. Text summariza-tion techniques: a brief survey. arXiv preprintarXiv:1707.02268.

Miltiadis Allamanis, Hao Peng, and Charles Sutton.2016. A convolutional attention network for ex-treme summarization of source code. In Inter-

national Conference on Machine Learning, pages2091–2100.

Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert:Pretrained contextualized embeddings for scientifictext. ArXiv, abs/1903.10676.

Antoine Bosselut, Asli Çelikyilmaz, Xiaodong He,Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018.Discourse-aware neural rewards for coherent textgeneration. In NAACL-HLT.

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In NAACL-HLT.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In ACL.

Sumit Chopra, Michael Auli, and Alexander M. Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In HLT-NAACL.

Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith.2019. Sentence moverâAZs similarity: Automaticevaluation for multi-sentence texts. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers).

Arman Cohan, Franck Dernoncourt, Doo Soon Kim,Trung Bui, Seokhwan Kim, Walter Chang, and NazliGoharian. 2018. A discourse-aware attention modelfor abstractive summarization of long documents.arXiv preprint arXiv:1804.05685.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL.

Angela Fan, David Grangier, and Michael Auli. 2018a.Controllable abstractive summarization. ACL 2018,page 45.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018b.Hierarchical neural story generation. In ACL.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. ConvolutionalSequence to Sequence Learning. In Proc. of ICML.

Sebastian Gehrmann, Yuntian Deng, and Alexander M.Rush. 2018. Bottom-up abstractive summarization.ArXiv, abs/1808.10792.

Max Grusky, Mor Naaman, and Yoav Artzi. 2019.Newsroom: A dataset of 1.3 million summaries withdiverse extractive strategies. In NAACL.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems, pages 1693–1701.

Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz,and Yejin Choi. 2019. Efficient adaptation of pre-trained transformers for abstractive summarization.ArXiv, abs/1906.00138.

Ari Holtzman, Jan Buys, Maxwell Forbes, AntoineBosselut, David Golub, and Yejin Choi. 2018.Learning to write with cooperative discriminators.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1638–1649.

Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, andLukasz Kaiser. 2019. Sample efficient text sum-marization using a single pre-trained transformer.ArXiv, abs/1905.08836.

Chloé Kiddon, Luke S. Zettlemoyer, and Yejin Choi.2016. Globally coherent text generation with neuralchecklist models. In EMNLP.

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,and Erkut Erdem. 2017. Re-evaluating automaticmetrics for image captioning. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, pages 199–209.

Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan,Mirella Lapata, and Hannaneh Hajishirzi. 2019.Text generation from knowledge graphs with graphtransformers. ArXiv, abs/1904.02342.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Peter J. Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating wikipedia by summariz-ing long sequences. ArXiv, abs/1801.10198.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Com-putational Linguistics - Volume 1, ETMTNLP ’02,pages 63–70, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Shashi Narayan, Shay B Cohen, and Mirella Lapata.2018a. Don’t give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. In EMNLP.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018b. Don’t give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. ArXiv, abs/1808.08745.

Ani Nenkova and Kathleen McKeown. 2012. A surveyof text summarization techniques. In Mining TextData.

Nikola I. Nikolov, Michael Pfeiffer, and Richard H. R.Hahnloser. 2018. Data-driven summarization of sci-entific articles. ArXiv, abs/1804.08875.

Dragomir R. Radev, Pradeep Muthukrishnan, VahedQazvinian, and Amjad Abu-Jbara. 2009. The acl an-thology network corpus. Language Resources andEvaluation, 47:919–944.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. ArXiv, abs/1509.00685.

Natalie Schluter. 2017. The limits of automatic sum-marisation according to rouge. In EACL.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.

Simeng Sun, Ori Shapira, Ido Dagan, and AniNenkova. 2019. How to compare summarizerswithout target length ? pitfalls , solutions and re-examination of the neural summarization literature.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

https://doi.org/10.3115/1118108.1118117

https://doi.org/10.3115/1118108.1118117

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2019a. Bertscore:Evaluating text generation with bert. ArXiv,abs/1904.09675.

Xingxing Zhang, Furu Wei, and Ming de Zhou. 2019b.Hibert: Document level pre-training of hierarchicalbidirectional transformers for document summariza-tion. ArXiv, abs/1905.06566.

A Appendices

A.1 Additional baselines forSASS benchmarks

ConvS2S For this model, we use the convolu-tional seq2seq architecture proposed by (Gehringet al., 2017). A given source text x = (x1, ..., xn)and the positions of tokens within that sourcetext p = (p1, ..., pn) are first encoded using em-bedding matrices. The embedded representationsare passed to a 20-layer convolutional encoderwith a kernel width of 3 and hidden size of 512.For the convolutional decoder, we use the samehyper-parameters. During decoding, we penal-ize generation of UNK characters and use diversebeam search to handle commonly occurring long-sequence generation errors.RNNExt The baseline we use for reinforcementlearning is from the work of (Chen and Bansal,2018). This model uses a hybrid extractive andabstractive summarization approach. First, eachsentence in the source text is encoded using a tem-poral convolutional model, then a bi-directionalLSTM is used to compute a context-aware repre-sentation. A set of salient sentences is extractedfrom among the encoded sentences using a LSTMpointer network. The seq2seq abstractor rewritesthese sentences through compression and para-phrasing. The whole model is trained using theREINFORCE algorithm by formulating a MarkovDecision Process where the extractor is a RL agentthat receives a reward each time the abstractor fin-ishes summarization of an extracted sentence.7

A.2 Additional SASS Tasks

We report summarization baseline results on thefull SASS dataset for the Abstract → Title, In-troduction → Title and Introduction → Abstracttasks.

Model Rouge-1 Rouge-2 Rouge-L

Last-1 0.1780 0.0523 0.1457Lede-1 0.3015 0.1409 0.2513LSTM 0.4044 0.2164 0.3641LSTM + PG (See et al., 2017) 0.4053 0.2180 0.3653ConvS2S (Gehring et al., 2017) 0.3421 0.1976 0.3207

Table 9: Abstract to Title

7We only use this baseline for the intro-to-abstract taskdue to the fact the convolutional encoder only accommodatesmulti-sentence summaries.


Last-3 0.2789 0.0728 0.1687Lede-3 0.2790 0.0740 0.1704LSTM 0.2436 0.0538 0.2192LSTM + PG (See et al., 2017) 0.3365 0.1083 0.2933ConvS2S (Gehring et al., 2017) 0.2857 0.0965 0.2451RNNExt (Chen and Bansal, 2018) 0.4000 0.1253 0.3683

Table 11: Intro to Abstract Task


Last-1 0.1147 0.0219 0.0959Lede-1 0.1649 0.0447 0.1339LSTM 0.1853 0.0672 0.1680LSTM + PG (See et al., 2017) 0.3189 0.1444 0.2886ConvS2S (Gehring et al., 2017) 0.3226 0.1794 0.3032

Table 10: Intro to Title

A.3 Impact of Discriminator

Model Avg # Words % New Decision

Base (topk,k=4) 49.11 —Co-opNet (topk, k=4) 86.40 97.6

Table 13: Analysis of how often the discriminatorchanges the sequence selected as the top candidate.

Model Task BERT-P BERT-R BERT-F1 SCIBERT-P SCIBERT-R SCIBERT-F1

PGen + Coverage (See et al., 2017) First-50 57.17 55.79 56.30 60.23 59.31 59.70Base (topk,k=4) First-50 59.16 54.72 56.72 62.97 59.42 61.05Co-opNet (topk, k=4) First-50 57.76 55.95 56.74 61.57 60.28 60.86

PGen + Coverage (See et al., 2017) Last-50 53.55 52.45 52.90 57.29 56.52 56.84Base (topk,k=4) Last-50 55.58 51.76 53.48 59.96 56.73 58.23Co-opNet (topk, k=4) Last-50 53.95 52.29 53.04 58.13 56.89 57.45

PGen + Coverage (See et al., 2017) Overall 60.44 55.88 57.86 61.15 57.53 59.13Base (topk,k=4) Overall 63.17 52.25 57.00 65.27 56.08 60.20Co-opNet (topk, k=4) Overall 60.66 56.23 58.21 62.03 58.51 60.11

Table 12: BERTScore results for AAN subset

Discriminator Networks Co-opNet arXiv:1907.01272v1 [cs.CL ... · Discriminator Networks (Co-opNet...

Documents

Transcript of Discriminator Networks Co-opNet arXiv:1907.01272v1 [cs.CL ... · Discriminator Networks (Co-opNet...