VWRSSHG - arxiv.org

13
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward Luyang Huang 1 Lingfei Wu 2 and Lu Wang 1 1 Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115 2 IBM Research AI, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 1 [email protected], [email protected] 2 [email protected] Abstract Sequence-to-sequence models for abstractive summarization have been studied extensively, yet the generated summaries commonly suffer from fabricated content, and are often found to be near-extractive. We argue that, to address these issues, the summarizer should acquire se- mantic interpretation over input, e.g., via struc- tured representation, to allow the generation of more informative summaries. In this pa- per, we present ASGARD, a novel framework for Abstractive Summarization with Graph- Augmentation and semantic-driven RewarD. We propose the use of dual encoders—a sequential document encoder and a graph- structured encoder—to maintain the global context and local characteristics of entities, complementing each other. We further design a reward based on a multiple choice cloze test to drive the model to better capture entity in- teractions. Results show that our models pro- duce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets. We also obtain better or comparable performance compared to systems that are fine- tuned from large pretrained language models. Human judges further rate our model outputs as more informative and containing fewer un- faithful errors. 1 Introduction Abstractive summarization aims to produce con- cise and informative summaries with the goal of promoting efficient information consumption and knowledge acquisition (Luhn, 1958). Signif- icant progress has been made in this area by de- signing sequence-to-sequence-based neural mod- els for single-document abstractive summariza- tion (Gehrmann et al., 2018; Liu et al., 2018; Liu and Lapata, 2019). However, due to the limita- tions of model structure and word prediction-based Input Article of New York Times: John M. Fabrizi, the mayor of Bridgeport, admitted on Tuesday that he had used cocaine and abused alcohol while in office. Mr. Fabrizi, who was appointed mayor in 2003 after the former mayor, Joseph P. Ganim, went to prison on corruption charges, said he had sought help for his drug problem about 18 months ago and that he had not used drugs since. About four months ago, he added, he stopped drinking alcohol. Constructed Graph: Summary by Human: The Week column. Mayor John Fabrizi of Brigeport, Conn, publicly admits he used cocaine and abused alcohol while in office; says he stopped drinking alcohol and sought help for his drug problem about 18 months ago. cocaine drinking alcohol alcohol John M. Fabrizi, he, ... had used abused stopped Figure 1: Sample knowledge graph constructed from an article snippet. The graph localizes relevant informa- tion for entities (color coded, e.g. “John M. Fabrizi”) or events (underlined) and provides global context. learning objectives, these models frequently pro- duce unfaithful content (Cao et al., 2018) and near- extractive summaries (See et al., 2017; Kry´ sci ´ nski et al., 2018). These observations suggest that ex- isting models lack semantic interpretation over the input, which is critical for summarization. We argue that the generation of informative and succinct abstracts requires structured representa- tion to facilitate the connection of relevant subjects, and the preservation of global context, e.g. entity interactions and topic flows. Take Fig. 1 as an ex- arXiv:2005.01159v1 [cs.CL] 3 May 2020

Transcript of VWRSSHG - arxiv.org

Page 1: VWRSSHG - arxiv.org

Knowledge Graph-Augmented Abstractive Summarizationwith Semantic-Driven Cloze Reward

Luyang Huang1 Lingfei Wu2 and Lu Wang1

1Khoury College of Computer Sciences, Northeastern University, Boston, MA 021152IBM Research AI, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598

[email protected], [email protected]@us.ibm.com

Abstract

Sequence-to-sequence models for abstractivesummarization have been studied extensively,yet the generated summaries commonly sufferfrom fabricated content, and are often foundto be near-extractive. We argue that, to addressthese issues, the summarizer should acquire se-mantic interpretation over input, e.g., via struc-tured representation, to allow the generationof more informative summaries. In this pa-per, we present ASGARD, a novel frameworkfor Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD.We propose the use of dual encoders—asequential document encoder and a graph-structured encoder—to maintain the globalcontext and local characteristics of entities,complementing each other. We further designa reward based on a multiple choice cloze testto drive the model to better capture entity in-teractions. Results show that our models pro-duce significantly higher ROUGE scores thana variant without knowledge graph as input onboth New York Times and CNN/Daily Maildatasets. We also obtain better or comparableperformance compared to systems that are fine-tuned from large pretrained language models.Human judges further rate our model outputsas more informative and containing fewer un-faithful errors.

1 Introduction

Abstractive summarization aims to produce con-cise and informative summaries with the goalof promoting efficient information consumptionand knowledge acquisition (Luhn, 1958). Signif-icant progress has been made in this area by de-signing sequence-to-sequence-based neural mod-els for single-document abstractive summariza-tion (Gehrmann et al., 2018; Liu et al., 2018; Liuand Lapata, 2019). However, due to the limita-tions of model structure and word prediction-based

Input Article of New York Times:John M. Fabrizi, the mayor of Bridgeport, admitted on Tuesday that he had used cocaine and abused alcohol while in office.Mr. Fabrizi, who was appointed mayor in 2003 after the former mayor, Joseph P. Ganim, went to prison on corruption charges, said he had sought help for his drug problem about 18 months ago and that he had not used drugs since. About four months ago, he added, he stopped drinking alcohol.

Constructed Graph:

Summary by Human:The Week column. Mayor John Fabrizi of Brigeport, Conn, publicly admits he used cocaine and abused alcohol while in office; says he stopped drinking alcohol and sought help for his drug problem about 18 months ago.

cocaine

drinking alcohol

alcoholJohn M. Fabrizi,

he, ...

had used

abused

stopped

Figure 1: Sample knowledge graph constructed froman article snippet. The graph localizes relevant informa-tion for entities (color coded, e.g. “John M. Fabrizi”)or events (underlined) and provides global context.

learning objectives, these models frequently pro-duce unfaithful content (Cao et al., 2018) and near-extractive summaries (See et al., 2017; Kryscinskiet al., 2018). These observations suggest that ex-isting models lack semantic interpretation over theinput, which is critical for summarization.

We argue that the generation of informative andsuccinct abstracts requires structured representa-tion to facilitate the connection of relevant subjects,and the preservation of global context, e.g. entityinteractions and topic flows. Take Fig. 1 as an ex-

arX

iv:2

005.

0115

9v1

[cs

.CL

] 3

May

202

0

Page 2: VWRSSHG - arxiv.org

ample. Complex events related with the same entitymay span multiple sentences, making it challeng-ing for existing sequential models to capture. Agraph representation, on the contrary, produces astructured summary and highlights the proximityof relevant concepts.

To this end, we present ASGARD, a frame-work for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD.1 Un-der the encoder-decoder framework, we enhancethe regular document encoder with a separategraph-structured encoder to maintain the globalcontext and local characteristics of entities by us-ing the outputs from an open information extraction(OpenIE) system.

Specifically, we experiment with two graph vari-ants, one mainly capturing entities’ document-levelinteractions and the other reflecting such interac-tions within each paragraph plus topic shifts acrossparagraphs. Both graphs can capture interactionsamong entities that are positioned far from oneanother in the document and significantly reduceredundancy, as shown in Fig. 1. The document en-coder and the graph encoder then cooperate duringabstract generation, wherein the model is trained toidentify salient content by aligning graphs with hu-man summaries. Though structured representationhas been studied before for summarization (Fer-nandes et al., 2019), to the best of our knowledge,we are the first to utilize graph neural networks toexplicitly encode entity-centered information forabstractive summary generation.

Moreover, we propose a novel multi-choice clozereward to drive the model to acquire semantic un-derstanding over the input. Concretely, we de-sign cloze questions by removing pairwise entitiesthat are connected with a predicate or co-occur ina human summary sentence, whereas prior workonly considers single entities to construct ques-tions (Eyal et al., 2019). In tandem with our graphencoding of knowledge, the cloze reward further fa-cilitates the acquisition of global entity interactionswith reinforcement learning.

We carry out automatic and human evaluationson popular summarization datasets. Models basedon ASGARD yield significantly better ROUGEscores (Lin and Hovy, 2003) than a variant with-out access to the knowledge graph on two popularnews summarization datasets, New York Times

1Our code is available at https://github.com/luyang-huang96/GraphAugmentedSum.

corpus and CNN/Daily Mail dataset. Moreover,ASGARD models attain performance better thanor comparable to others that are fine-tuned fromlarge pretrained language models, including BERT-Sum (Liu and Lapata, 2019), UniLM (Dong et al.,2019), and BART (Lewis et al., 2019). Humanjudges further confirm that our models generatemore informative summaries with less unfaithfulerrors than their counterparts without the graphencoder. Importantly, we find that automatic eval-uation metrics only weakly correlate with theseerrors, implying that new evaluation methods areneeded to better gauge summary quality.

The rest of the paper is organized as follows. Wedescribe related work in the next section (§ 2). Wethen discuss the knowledge graph construction in§ 3 and formulate our graph-augmented summa-rization framework in § 4. In § 5, we introducereinforcement learning with cloze reward. Exper-iments and results are presented in § 6 and § 7.Finally, we conclude in § 8.

2 Related Work

Graph-Augmented Summarization and Gener-ation. Graph structures have long been used forextractive summarization, such as in Textrank (Mi-halcea and Tarau, 2004) and Lexrank (Erkan andRadev, 2004). For neural models, Tan et al. (2017)design graph-based attention to identify importantsentences. For generating abstractive summaries,Fernandes et al. (2019) enhance a sequence-basedencoder with graph neural networks (GNNs) to con-sider token-level entity types, however, entity in-teractions are largely ignored. On multi-documentsummarization, Fan et al. (2019) demonstrate theusefulness of encoding a linearized knowledgegraph from OpenIE outputs. In this work, wedesign a graph encoder, which improves uponGraph Attention Networks (GATs) (Velickovicet al., 2018), to capture the global context in amore effective manner.

Also related is the graph-to-sequence frameworkthat has been adopted for text generation (Songet al., 2018). Both Gated Graph Neural Networks(GGNNs) (Beck et al., 2018) and Graph Convo-lutional Networks (GCNs) (Damonte and Cohen,2019) are shown to be effective in generating sen-tences from AMR graphs. Since Graph AttentionNetworks can better handle sparse graphs, theyare used by Koncel-Kedziorski et al. (2019) witha transformer model to create scientific paper ab-

Page 3: VWRSSHG - arxiv.org

stracts from knowledge graphs. Here we use graphsin addition to document encoder, both carryingcomplementary information for summarization.

Reinforcement Learning and QA Reward forAbstractive Summarization. As pointed outby Ranzato et al. (2016), word-level maximumlikelihood training brings the problem of exposurebias. Recent work utilizes reinforcement learningto directly optimize the model to maximize theinformativeness of summaries by using differentforms of ROUGE scores (Paulus et al., 2018; Chenand Bansal, 2018; Sharma et al., 2019). However,ROUGE does not always distinguish good sum-maries from bad ones (Novikova et al., 2017), andignores entity interactions.

Since question answering (QA) has been usedfor summary evaluation (Narayan et al., 2018), andis shown to correlate with human judgment of sum-maries qualities (Eyal et al., 2019), QA-based re-wards have been studied for summarization modeltraining. Arumae and Liu (2019) demonstrate thatusing fill-in-the-blank questions by removing enti-ties or root words leads to improved content selec-tion. Scialom et al. (2019) consider a similar setup,but use both F1 score and QA system confidenceas rewards in abstractive summarization. Previouswork, however, mainly focuses on single entities orwords in human-written summaries, thereby losingcontexts and relations. Moreover, fill-in-the-blankquestions by prior work give credits only whenthe answers exactly match the ground-truths, thuscausing inaccuracies for rephrased answers and dis-couraging abstract content generation. In contrast,we design a semantic-driven cloze reward by mea-suring how well a QA system can address multiplechoice cloze questions which better encode entityinteractions and handle paraphrased answers.

3 Knowledge Graph Construction

To construct a knowledge graph from an input doc-ument, we utilize Stanford CoreNLP (Manninget al., 2014) to first obtain outputs from corefer-ence resolution and open information extraction(OpenIE) models (Angeli et al., 2015). Note thatwe do not conduct global entity linking across doc-uments. Next, we take the 〈subject, predicate,object〉 triples extracted by OpenIE and removeany triple whose argument (subject or object) hasmore than 10 words. If two triples differ only byone argument, and the arguments overlap, we keepthe longer triple.

Generated Summary

Input ArticleMayor 's Admission of Cocaine Use ….

RoBERTa Layers

Bi-LSTM Layer GAT Layers

Attention Layer

The columnWeek

...

Attention Layer

OpenIE

<SOS>

CtCtv

Node Initialization

Figure 2: Our ASGARD framework with document-level graph encoding. Summary is generated by attend-ing to both the graph and the input document.

We begin constructing the graph by treating sub-jects and objects as nodes connected by directededges, with predicates as attributes. We further col-lapse coreferential mentions of the same entity intoone node. With this, we can localize salient contentrelated to each entity as well as make connectionsof spread-out entities through graph paths.

4 Summarization Model

In this section, we describe our graph-augmentedabstractive summarization framework, as displayedin Fig. 2. Our model takes as input a document,represented as a sequence of tokens x = {xk}, anda knowledge graph G consisting of nodes {vi}. xand G are separately consumed by a document en-coder and a graph encoder, as presented in § 4.1.Importantly, we present two types of graphs: DOC-GRAPH, focusing on the global context, and SEG-GRAPH, which additionally captures topic shift.The summary decoder then generates an abstrac-tive summary by attending to both the documentand the graph (§ 4.2). In § 4.3, we formulate a max-imum likelihood training objective which leveragesthe detection of salient nodes in the graph.

4.1 Encoders

Document Encoder. We first feed input x toRoBERTa (Liu et al., 2019) and take the last layer

Page 4: VWRSSHG - arxiv.org

output as token embeddings. We then employ asingle-layer bidirectional LSTM (BiLSTM) over to-ken embeddings, producing encoder hidden stateshk at time step k.

Graph Encoder. Built on the graph constructed in§ 3, we create nodes for predicates as done in pre-vious graph-to-sequence work (Beck et al., 2018)to reduce model parameters. Directed, unlabelededges are added from subject to predicate, and frompredicate to object. We further add reverse edgesand self-loops to enhance the information flow, andthis forms the graph G.

Node Initialization. Each node often contains mul-tiple mentions of an entity; we thus initialize noderepresentation vi by using the average embeddingof its tokens. We leverage document encoder hid-den states hk as the contextual representation oftokens. Number of mentions in the node is added asan extra encoding to vi, to signify entity salience.

Contextualized Node Encoding. Our graph en-coder improves upon Graph Attention Networks(GATs) (Velickovic et al., 2018) by adding residualconnections between layers as discussed in Koncel-Kedziorski et al. (2019). Each node vi is repre-sented by a weighted average of its neighbors:

vi = vi + ‖Nn=1

∑vj∈N (vi)

αni,jW0,nvj (1)

αni,j = softmax((W1,nvi)

T (W2,nvj)) (2)

where ‖Nn=1 denotes the concatenation of N heads,each producing a vector of the same dimension asvi. We use N = 4 in our experiments with twolayers of GATs. N (vi) denotes the neighbors of viin graph G. W∗ are trainable parameters.

The graph encoder described above encodesdocument-level global context by merging entitymentions throughout the document and capturingtheir interactions with graph paths. It is henceforthdenoted as DOCGRAGH.

Encoder Extension to Capture Topic Shift(SEGGRAGH). Modeling topic transitions and re-currences enables the identification of notable con-tent, thus benefiting summarization (Barzilay andLee, 2004). Since paragraphs naturally divide adocument into different topic segments, we extendDocGragh by first encoding each paragraph as asubgraph Gp (for the p-th paragraph) using thesame graph encoder, and then connecting all sub-graphs with a BiLSTM. If two nodes in separatesubgraphs refer to the same entity, they are initial-

ized with the same embedding (as in the first oc-currence). Concretely, we first apply max-poolingover all nodes in subgraph Gp from the outputs ofthe final GAT layer; the max-pooling results arethen used as inputs for a BiLSTM to produce thefinal subgraph representation hg

p for Gp.

4.2 Summary DecoderOur summary decoder uses a single-layer unidi-rectional LSTM with a hidden state st at step t; itgenerates summary tokens recurrently by jointlyattending to the input document and the graph.Attending the Graph. At each decoding step t,we compute a graph context vector cvt with theattention mechanism (Bahdanau et al., 2014):

cvt =∑i

avi,tvi (3)

avi,t = softmax(uT0 tanh(W3st +W4vi)) (4)

where u∗ are also trainable parameters. We omitbias terms for simplicity.Attending the Document. Similarly, the docu-ment context ct is computed over input tokens byadditionally considering the graph context cvt :

ct =∑k

ak,thk (5)

ak,t = softmax(

uT1 tanh(W5st +W6hk +W7c

vt )) (6)

Token Prediction. Graph and document contextvectors, treated as salient content summarized fromboth sources, are concatenated with the decoderhidden state st to produce the vocabulary distribu-tion Pvocab:

Pvocab = softmax(Wout[st|ct|cvt ]) (7)

We use weight-sharing between the input embed-ding matrix and the matrix Wout to allow reusinglinguistic knowledge as proposed by Paulus et al.(2018). We further add a copy mechanism similarto See et al. (2017), with copy probability as:

Pcopy = σ(Wcopy[st|ct|cvt |yt−1]) (8)

where yt−1 denotes the embedding for the tokenpredicted at step t− 1.Modified Hierarchical Attention for SegGraph.As mentioned in § 4.1, SegGraph captures contentsalience by modeling topic shift across paragraphs.We thus seek to leverage paragraph-level impor-tance to redistribute the node attentions, e.g., giving

Page 5: VWRSSHG - arxiv.org

more attentions to nodes in important paragraphs.In particular, we utilize hierarchical attention (Hsuet al., 2018), where we first calculate attention agtover subgraphs as done in Eq. 3 by replacing vi

with subgraph representation hgp.

We then combine subgraph attentions agt withthe previously calculated attentions avt for nodes inthe subgraph using scalar multiplication and renor-malization over all nodes in input. This results inthe new attention weights avt , which are used toobtain graph context vector cvt as done in Eq. 3 forSegGraph.

4.3 Training ObjectivesWe first consider a maximum likelihood (ML) train-ing objective that minimizes the following loss:

Lseq = − 1

|D|∑

(y,x)∈D

log p(y |x; θ) (9)

where x are documents and y are references fromthe training set D, and θ are model parameters.

Node Salience Labeling. In addition to model-ing local characteristics of nodes, we further en-hance the model by adding an objective to labelnode salience, e.g., whether the entities in a nodeare mentioned in the reference summaries. We in-troduce a soft mask layer over each node beforeit is passed into the graph encoder, to signify itssalience. This layer, serving as an information gate,predicts a real number mi in [0, 1] for each node vi

and multiplies to itself, i.e. mivi. For node vi, themask is calculated as mi = sigmoid(u2vi). Dur-ing training, the gold-standard mask mi for a nodeis set to 1 if it contains at least one content word inthe reference summary; otherwise, 0. We add thefollowing objective for all nodes in the dataset D:

Lmask = − 1

Nv

∑vi∈D

mi log(mi)+

(1−mi) log(1− mi) (10)

where Nv represents the number of nodes in thedataset. Finally, the ML training objective takesthe following form: Lml = Lmask + Lseq.

5 Reinforcement Learning with Cloze

After maximum likelihood training with Lml, wefurther design a multiple choice cloze reward in asecond-stage reinforcement learning (RL), leadingthe model to generate more faithful and informativesummaries.

For RL, we use a self-critical policy gradientalgorithm (Rennie et al., 2017). During training,two summaries are generated: first, a summary ys,sampling tokens based on the probability distribu-tion p(ys|x; θ) at each decoding step; and second,a baseline summary y which greedily selects thetokens of the highest probability at each step. Theobjective of RL is defined based on the rewards ofthe two summaries, R(ys) and R(y), as follows:

Lrl =

− 1

|D|∑

(ys,x)∈D

(R(ys)−R(y)) log p(ys|x; θ)

(11)

Our reward function uses the combinationof ROUGE and the multiple choice cloze scoreintroduced below, i.e., R(y) = Rrouge(y) +γclozeRcloze. For ROUGE, it considers F1 scoresof ROUGE-1, ROUGE-2, and ROUGE-L calcu-lated against the reference summary, and takesthe form of Rrouge(y) = γ1Rrouge−1(y) +γ2Rrouge−2(y) + (1− γ1 − γ2)Rrouge−L(y).Multiple Choice Cloze Reward. Here, wepresent a novel multiple choice cloze reward towork with our knowledge graph and guide thesummarization model towards improved aware-ness of entity interactions. We treat the system-generated summary as context. We provide a setof questions automatically constructed from thecorresponding reference summary written by a hu-man. We separately train a question answering(QA) model to address the questions by readingthe context. Intuitively, if the system summaryshares salient information with the reference, theQA model will assign the correct answers withhigh probability. We decide to use the averageprobability of the correct answers as our cloze re-ward. Below, we give details on how to constructthe questions and candidate answers with examplesshown in Fig. 3.Question Construction. We run the OpenIE tool onhuman-written summaries, retaining triples witharguments not longer than 5 words. For each tripleof 〈subject, predicate, object〉, we create two typesof questions: (1) argument pair questions, by re-moving the subject and object, and (2) predicatequestions, by removing the predicate.Candidate Answer Construction. Because fill-in-the-blank style cloze may incorrectly penalize QAsystems with answers paraphrased from the ground-truth, we opt for a multiple choice cloze. We con-struct three candidate answers in addition to the

Page 6: VWRSSHG - arxiv.org

Reference Summary:Federal Reserve increases interest rates.

IE Output:〈 Federal Reserve, increases, interest rates 〉

Salient Context:Federal Reserve signals positivity about the market. Fedincreases benchmark interest rate again this May. Americaneconomy keeps the high growth rate. Jerome H. Powelldiscussed potential risks.

IE Outputs:1. 〈 Federal Reserve, signals, positivity 〉2. 〈 American economy, keeps, the high growth rate 〉3. 〈 Jerome H. Powell, discussed, potential risks 〉

⇓Multiple Choice Cloze Questions:

Argument Pair Question: increases .A. Federal Reserve, interest rates (D)B. interest rates, Federal Reserve (swapping args in A)C. American economy, interest rates (replacing arg us-

ing triple 2)D. Federal Reserve, potential risks (replacing arg using

triple 3)

Predicate Question: Federal Reserve interest rates.A. increases (D) B. signals C. keeps D. discussed

Figure 3: Sample construction of multiple choice clozequestions and candidate answers from reference sum-mary and salient context. Arguments and predicates incandidate answers are color-coded and italicized.

gold-standard from the salient context, which aresummary-worthy sentences selected from the input.Specifically, we use greedy search to select the bestcombination of sentences that maximizes ROUGE-2 F1 with reference to human summary. We furtherinclude a sentence in the salient context if it has aROUGE-L recall greater than 0.6 when comparedwith any sentence in the reference.

We first select OpenIE triples from the salientcontext and filter out those that have any overlap-ping content word with the correct answer. Forargument pair questions, we create one candidateanswer by swapping the subject and the object (e.g.candidate B as in Fig. 3) and two candidates byreplacing the subject or the object with another ar-gument of the same role extracted from the salientcontext (e.g. candidates C and D). If not enoughanswers are created, we further consider randomlyselecting sentences from the input. For predicatequestions, we use predicates in other triples fromthe context as candidate answers. Among all candi-dates, we select the three that are able to constructthe most fluent questions using perplexity predictedby BERT (Devlin et al., 2019).

In case reference summaries do not yield Ope-nIE triples, we create additional entity pair ques-tions. We remove two co-occurring entities fromthe summary and create three candidate answers inthe same way as described above.QA Model. We fine-tune RoBERTa (Liu et al.,2019) to build our QA model. We use the salientcontext described above as the context for training.We then concatenate the context, the question, andeach of the four candidate answers, and pass the fi-nal [CLS] representation through a fully-connectedlayer, from which the answer is predicted.

6 Experimental Setups

Datasets. We experiment with two popular summa-rization datasets with summaries containing multi-ple sentences: the New York Times annotated cor-pus (NYT) (Sandhaus, 2008) and the CNN/DailyMail dataset (CNN/DM) (Hermann et al., 2015).We follow the preprocessing steps and experimen-tal setups from prior work (Paulus et al., 2018;See et al., 2017) for both datasets. For NYT, thetraining, validation, and test sets contain 588, 909,32, 716, and 32, 703 samples. For CNN/DM, thenumbers are 287, 188, 13, 367, and 11, 490.

To train our cloze QA model for NYT, weconstruct 1, 414, 336 question-answer pairs fromhuman-written summaries in the training set basedon the method described in § 5. On CNN/DM, wecollect 1, 361, 175 question-answer samples fromthe training set. For both datasets, we set aside20, 000 samples as a validation set and 20, 000samples as a test set. Our QA model achieves anaccuracy of 97% on NYT and 95% on CNN.Training Details and Parameters. We use thebase version of RoBERTa model to extract tokenfeatures for all experiments. We truncate input ar-ticles to 1024 (NYT) and 512 (CNN/DM) BPEs.We employ LSTM models with 256-dimensionalhidden states for the document encoder (128 eachdirection) and the decoder. For the residual con-nection of the graph encoder, we use 4 heads, eachwith a dimension of 72. For DocGraph trainingand inference, we prune isolated graphs with fewerthan three nodes to increase robustness and reduceredundancy. We set γ1 = 0, γ2 = 0.75 on NYTand γ1 = 0.33, γ2 = 0.33 on CNN/DM after tun-ing on the validation set. For both datasets, we setγcloze = 0.05. More details about parameters andgraph statistics are in the Appendices.Baselines and Comparisons. For both datasets,

Page 7: VWRSSHG - arxiv.org

System ROUGE-1 ROUGE-2 ROUGE-L

LEAD-3 32.59 16.49 29.17POINTGEN+COV 41.06 25.71 37.28DEEPREINFORCE 47.03 30.72 43.10BOTTOMUP 47.38 31.23 41.81DCA 48.08 31.19 42.33SENECA 47.94 31.77 44.34BART 53.25 36.61 48.78Our ModelsNOGRAPH 47.15 32.02 43.65+Rrouge 49.17 33.19 46.44

ASGARD-DOC 49.51 33.82 45.72+Rrouge 50.18 33.91 46.84+Rrouge +Rcloze 50.59 33.98 48.24

ASGARD-SEG 49.54 33.84 45.75+Rrouge 50.47 33.95 47.43+Rrouge +Rcloze 51.29 34.97 48.26

Table 1: Automatic evaluation with ROUGE on NewYork Times. Best results are in boldface. Best of ourmodels are in italics. ASGARD-SEG+Rrouge+Rcloze

yields significantly higher scores than our other modelswith approximate randomization test (p < 0.0005).

we include an extractive baseline LEAD-3. Wefurther add the following abstractive models forcomparison: (1) a pointer-generator model withcoverage (See et al., 2017) (POINTGEN+COV); (2)a deep reinforcement learning-based model (Pauluset al., 2018) (DEEPREINFORCE); (3) a bottom-upmodel (Gehrmann et al., 2018) (BOTTOMUP); (4) adeep communicating agents-based summarizationmodel (Celikyilmaz et al., 2018) (DCA). We alsoreport results by fine-tuning BART model (Lewiset al., 2019). In Lewis et al. (2019), fine-tuning isonly performed on CNN/Daily Mail. We apply thesame method for NYT.

For NYT, we add results by SENECAmodel (Sharma et al., 2019) from our prior work,which previously achieved the best ROUGE-2.

On CNN/Daily Mail, we include comparisonsof a two-stage fine-tuned model (first on an extrac-tor, then on an abstractor) with BERT (Liu andLapata, 2019) (BERTSUMEXTABS), and a unifiedpretrained language model for generation (Donget al., 2019) (UNILM).

In addition to ASGARD-DOC and ASGARD-SEG, which are trained with an ML objective,we report results trained with ROUGE as the re-ward (Rrouge), and with an additional cloze reward(Rcloze). Lastly, we consider a variant NOGRAPH

by ablating the graph encoder.

System ROUGE-1 ROUGE-2 ROUGE-L

LEAD-3 40.23 17.52 36.34POINTGEN+COV 39.53 17.28 36.38DEEPREINFORCE 41.16 15.75 39.08BOTTOMUP 41.22 18.68 38.34DCA 41.69 19.47 37.92BERTSUMEXTABS 42.13 19.60 39.18UNILM 43.33 20.21 40.51BART 44.16 21.28 40.90Our ModelsNOGRAPH 39.55 17.89 36.75+Rrouge 41.37 17.63 37.99

ASGARD-DOC 40.38 18.40 37.51+Rrouge 43.10 17.58 39.41+Rrouge +Rcloze 43.93 20.37 40.48

ASGARD-SEG 40.09 18.30 37.30+Rrouge 42.94 17.93 39.36+Rrouge +Rcloze 43.81 20.22 40.37

Table 2: Automatic evaluation with ROUGE onCNN/Daily Mail. Best results of our model variants arein italics. Both ASGARD-SEG+Rrouge+Rcloze andASGARD-DOC+Rrouge+Rcloze obtain significantlybetter scores than other model variants (p < 0.0005).

7 Results

7.1 Automatic Evaluation

Results on NYT. As displayed in Table 1, ourASGARD-SEG model trained with ROUGE andcloze rewards achieves better ROUGE scores (Linand Hovy, 2003) than all other comparisons exceptthe fine-tuned BART. However, our ASGARD-SEG’s ROUGE-L score is comparable to BART.This indicates the effectiveness of our graph-augmented summarization framework.

Moreover, both our ASGARD-DOC andASGARD-SEG models yield significantly higherROUGE scores than the variant without thegraph encoder (NOGRAPH). This demonstratesthe benefit of using structured representation toencode entity interactions. Furthermore, bothASGARD-DOC and ASGARD-SEG with clozereward (Rcloze) obtain significantly higher scorescompared to the models trained with ROUGE re-ward only. This signifies that our multi-choicecloze reward can guide better semantic interpreta-tion of content, leading to the generation of more in-formative summaries. We also find that ASGARD-SEG outperforms ASGARD-DOC, indicating thatASGARD-SEG better captures topic drift throughmultiple paragraphs.

Results on CNN/DM. We observe similar trendson the CNN/DM articles as shown in Table 2. No-

Page 8: VWRSSHG - arxiv.org

NYT CNN/DM55

60

65

70

75

80

85

90Cl

oze

Scor

e91.1 90.8

68.366.7

72.775.9

71.0

75.7

Probability

NYT CNN/DM70

75

80

85

90

95

100 97.896.6

78.776.1

82.184.2

80.9

83.9

Accuracy

HumanNoGraph+Rrouge

ASGARD-doc+Rrouge+Rcloze

ASGARD-seg+Rrouge+Rcloze

Figure 4: Evaluation with QA model prediction prob-ability and accuracy on our multiple choice cloze test,with higher numbers indicating better summaries.

ticeably, ASGARD-DOC trained with the com-bined ROUGE and cloze reward produces bet-ter ROUGE scores than BERTSUMEXTABS andUNILM, which are carefully fine-tuned from largepretrained language models, and the numbers arealso comparable to the fine-tuned BART.

Evaluation with Cloze Test. We further evalu-ate model-generated summaries with our proposedcloze test. Here, we report two scores in Fig. 4: theaverage probability of the correct answers outputby our QA model, and its prediction accuracy. Wefirst calculate one score per summary, then take theaverage over all summaries. We can see that ourmodels with graph encoders perform better thanthe variant without it.

7.2 Human Evaluation

We further conduct human evaluation to analyzethe informativeness and fluency of the generatedsummaries, as well as to investigate the unfaithfulerrors made by different models. We sample 100articles from the NYT test set and hire three na-tive or fluent speakers of English to rate summariesgenerated by our two systems, NOGRAPH+Rrouge

and ASGARD-SEG+Rrouge +Rcloze, along withoutputs by BART and human-written summaries(presented in random order). After reading the ar-ticles, each judge scores summaries on a Likertscale from 1 (worst) to 5 (best) on informative-ness—whether the summary covers important in-formation from the input, and fluency—whetherthe summary is grammatically correct.

We consider three types of unfaithful errors: (i)hallucination error—creating content not presentin the input, (ii) out-of-context error—generatingfacts without including required context or within

System Inf.↑ Flu.↑ Hal.↓ Out.↓ Del./Sub.↓

HUMAN 4.47 4.65 21% 10% 10%NOGRAPH +Rrouge 3.94 3.65 9%∗ 26% 22%ASGARD-SEG

+Rrouge +Rcloze 4.12† 3.77† 23% 14%† 9%∗

BART 4.44∗ 4.66∗ 16% 15% 12%

Table 3: Human evaluation on informativeness (Inf.)and fluency (Flu.) (1-to-5), and percentages of unfaith-ful errors of hallucination (Hal.), out-of-context (Out.)and deletion or substitution (Del./Sub.). ∗: significantlydifferent from all other models. †: ASGARD-SEG issignificantly better than NOGRAPH (p < 0.05). Inter-rater agreement with Krippendorf’s α for all columns:0.61, 0.70, 0.57, 0.50 and 0.43.

Summary by Human:Family Court in Burlington County, NJ, rules that lesbiancouple can list both their names as parents on birth cer-tificate of newborn; state attorney general’s office dropsopposition to move; court ruling negates couple’s hav-ing to go through adoption proceedings to establish fullparental rights for both.NoGraph+Rrouge:Lesbian couple in South Jersey wins court approval to haveboth of their names listed as parents on birth certificate oftheir newborn. it will no longer oppose such applicationsASGARD-doc+Rrouge +Rcloze:Lesbian couple in South Jersey, won court approval to haveboth of their names listed as parents on birth certificate oftheir newborn. attorney general’s office says it will nolonger oppose such applicationsASGARD-seg+Rrouge +Rcloze:Lesbian couple in South Jersey wins court approval to haveboth of their names listed as parents on birth certificate ofnewborn and attorney general ’s office will no longer op-pose such applications. decision stems from Oct 0 rul-ing by New Jersey Supreme Court holding that same-sex couples are entitled to same legal rights and protec-tions as heterosexual couples

Figure 5: Sample summaries for an NYT article. Sum-maries by our models with the graph encoder are moreinformative than the variant without it.

incorrect context, and (iii) deletion or substitu-tion error—mistakenly deleting or substitutingsubjects, objects, or clauses. We ask the anno-tators to label each type as 1 for existence of errors,and 0 otherwise. Detailed guidelines are in theAppendices.

From Table 3, we can see that our ASGARD-SEG model obtains better scores in informativenessand fluency, compared to the variant without thegraph encoder. This indicates the effectiveness ofleveraging knowledge graph representation. Sam-ple output summaries by our models can be foundin Fig. 5. Meanwhile, fine-tuned BART modelproduces outputs with similar informativeness andfluency of human-constructed summaries, suggest-

Page 9: VWRSSHG - arxiv.org

ing a future direction of building our model on topof a large-pretrained encoder-decoder model.

For unfaithful errors, we report the percentageof errors calculated by majority voting (i.e., morethan one annotator vote as incorrect). First, we findthat our ASGARD-SEG model has a comparableerror pattern as human summaries. Specifically, forout-of-context and deletion or substitution errors,our graph-enhanced model produces significantlyfewer mistakes in these categories, compared to themodel without graph information. This implies thatknowledge graph-enhanced models can improvesummary faithfulness.

Interestingly, human-written summaries are alsodiscerned to contain a nontrivial amount of halluci-nation errors. After inspection, we find that humantends to leverage world knowledge to include con-tent that is not covered by the articles. For instance,for an article discussing events in “Boston”, thehuman writer may describe them as happening in“Massachusetts” in the summary.

7.3 Analyzing Automatic Metrics andSummary Errors

We further plot the distributions of automatic evalu-ation scores regarding the three types of unfaithfulerrors based on majority voting in Fig. 6. First,summaries with out-of-context and deletion or sub-stitution errors receive lower cloze and ROUGEscores overall.

Nevertheless, with regard to hallucination er-rors, we do not see such pattern; there is even aslightly reversed relation with both cloze scores andROUGE scores, wherein summaries with more hal-lucination errors tend to score higher. This echosour previous observation that human summariescan be hallucinatory too, where world knowledgeis used for writing the summaries.2

Furthermore, we find a weak correlation betweenthe three variants of ROUGE scores and three typesof errors, e.g., the minimum and the maximumvalues of Pearson’s r are −0.19 and 0.14. Thissuggests that new metrics should be designed tobetter gauge summary quality. We plan to studythis direction in future work.

2During human evaluation, we do not ask human judges todistinguish the source of hallucination errors, i.e. from worldknowledge or out of fabrication, since this requires significantdomain knowledge.

True False0.0

0.2

0.4

0.6

0.8

1.0Cloze Score

True False

ROUGE-1

True False

ROUGE-2

True False

ROUGE-L

True False0.0

0.2

0.4

0.6

0.8

1.0

True False True False True False

True False0.0

0.2

0.4

0.6

0.8

1.0

True False True False True False

Auto

mat

ic E

valu

atio

n Hallucination

Out-of-Context

Deletion/Substitution

Figure 6: Distribution of automatic summarization met-rics with three types of unfaithful errors. “True” indi-cates summaries with such type of error.

8 Conclusion

We presented a novel knowledge graph-augmentedabstractive summarization framework, along witha novel multiple choice cloze reward for reinforce-ment learning. Our models capture both local char-acteristics and global interactions of entities fromthe input, thus generating summaries of higher qual-ity. In tandem with the graph representation, ourcloze reward further improves summary content.Human evaluation further confirms that our graph-augmented models trained with the cloze rewardproduce more informative summaries and signifi-cantly reduces unfaithful errors.

Acknowledgements

This research is supported in part by National Sci-ence Foundation through Grant IIS-1813341, andby the Office of the Director of National Intelli-gence (ODNI), Intelligence Advanced ResearchProjects Activity (IARPA), via contract # FA8650-17-C-9116. The views and conclusions containedherein are those of the authors and should not be in-terpreted as necessarily representing the officialpolicies, either expressed or implied, of ODNI,IARPA, or the U.S. Government. The U.S. Gov-ernment is authorized to reproduce and distributereprints for governmental purposes notwithstand-ing any copyright annotation therein. We thank theanonymous reviewers for their suggestions.

Page 10: VWRSSHG - arxiv.org

References

Gabor Angeli, Melvin Jose Johnson Premkumar, andChristopher D. Manning. 2015. Leveraging linguis-tic structure for open domain information extraction.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages344–354, Beijing, China. Association for Computa-tional Linguistics.

Kristjan Arumae and Fei Liu. 2019. Guiding ex-tractive summarization with question-answering re-wards. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages2566–2577, Minneapolis, Minnesota. Associationfor Computational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Regina Barzilay and Lillian Lee. 2004. Catching thedrift: Probabilistic content models, with applicationsto generation and summarization. In Proceedings ofthe Human Language Technology Conference of theNorth American Chapter of the Association for Com-putational Linguistics: HLT-NAACL 2004, pages113–120.

Daniel Beck, Gholamreza Haffari, and Trevor Cohn.2018. Graph-to-sequence learning using gatedgraph neural networks. In Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 273–283, Melbourne, Australia. Associationfor Computational Linguistics.

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact aware neural abstrac-tive summarization. In Proceedings of the Associ-ation for the Advancement of Artificial Intelligence(AAAI).

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 1662–1675, New Orleans, Louisiana.Association for Computational Linguistics.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 675–686.

Marco Damonte and Shay B Cohen. 2019. Structuralneural encoders for amr-to-text generation. In Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 3649–3658.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 32, pages 13063–13075. Curran As-sociates, Inc.

Gunes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. Journal of artificial intelligence re-search, 22:457–479.

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.Question answering as an automatic evaluation met-ric for news article summarization. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 3938–3948, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Angela Fan, Claire Gardent, Chloe Braud, and An-toine Bordes. 2019. Using local knowledge graphconstruction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 4177–4187, Hong Kong, China. As-sociation for Computational Linguistics.

Patrick Fernandes, Miltiadis Allamanis, and MarcBrockschmidt. 2019. Structured neural summariza-tion. In International Conference on Learning Rep-resentations.

Sebastian Gehrmann, Yuntian Deng, and AlexanderRush. 2018. Bottom-up abstractive summarization.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 4098–4109, Brussels, Belgium. Associationfor Computational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,

Page 11: VWRSSHG - arxiv.org

and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Advances in neural informationprocessing systems, pages 1693–1701.

Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, KeruiMin, Jing Tang, and Min Sun. 2018. A unifiedmodel for extractive and abstractive summarizationusing inconsistency loss. In Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 132–141, Melbourne, Australia. Associationfor Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan,Mirella Lapata, and Hannaneh Hajishirzi. 2019.Text Generation from Knowledge Graphs withGraph Transformers. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 2284–2293, Minneapolis, Minnesota.Association for Computational Linguistics.

Wojciech Kryscinski, Romain Paulus, Caiming Xiong,and Richard Socher. 2018. Improving abstraction intext summarization. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1808–1817.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461.

Chin-Yew Lin and Eduard Hovy. 2003. Auto-matic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of the 2003Conference of the North American Chapter of the As-sociation for Computational Linguistics on HumanLanguage Technology - Volume 1, pages 71–78.

Peter J. Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating wikipedia by summariz-ing long sequences. In International Conference onLearning Representations.

Yang Liu and Mirella Lapata. 2019. Text summariza-tion with pretrained encoders. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3721–3731, Hong Kong,China. Association for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

Hans Peter Luhn. 1958. The automatic creation of lit-erature abstracts. IBM Journal of research and de-velopment, 2(2):159–165.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing, pages 404–411.

Shashi Narayan, Shay B Cohen, and Mirella Lapata.2018. Dont give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 1797–1807.

Jekaterina Novikova, Ondrej Dusek, Amanda CercasCurry, and Verena Rieser. 2017. Why we need newevaluation metrics for nlg. In 2017 Conference onEmpirical Methods in Natural Language Processing,pages 2231–2242. Association for ComputationalLinguistics.

Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep reinforced model for abstractive sum-marization. In International Conference on Learn-ing Representations.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In 4th Inter-national Conference on Learning Representations,ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,Conference Track Proceedings.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 7008–7024.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12):e26752.

Thomas Scialom, Sylvain Lamprier, Benjamin Pi-wowarski, and Jacopo Staiano. 2019. Answersunite! unsupervised metrics for reinforced summa-rization models. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3244–3254, Hong Kong, China. As-sociation for Computational Linguistics.

Page 12: VWRSSHG - arxiv.org

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computa-tional Linguistics.

Eva Sharma, Luyang Huang, Zhe Hu, and Lu Wang.2019. An entity-driven framework for abstractivesummarization. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3278–3289, Hong Kong, China. As-sociation for Computational Linguistics.

Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018. A graph-to-sequence model for AMR-to-text generation. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1616–1626, Melbourne, Australia. Association for Compu-tational Linguistics.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.Abstractive document summarization with a graph-based attentional neural model. In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1171–1181, Vancouver, Canada. Associationfor Computational Linguistics.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph Attention Networks. InternationalConference on Learning Representations. Acceptedas poster.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv, abs/1910.03771.

A Appendices

A.1 Experiment DetailsStatistics of Knowledge Graphs. We show thestatistics of knowledge graphs on two datasets inTable 4. On each dataset, we construct a large graphwith abundant relations for each article. Note thaton CNN/DM we have more arguments but fewerpredicates in a document than those on NYT. Thisindicates CNN/DM has fewer coreferred entities.

Training Details. We utilize Adam (Kingma andBa, 2015) with a gradient clipping of 2.0 and abatch size of 32 for all models. During ML training,a learning rate of 0.001 is used; during RL stage, itis reduced to 0.0001 (Paulus et al., 2018).

Dataset Doc DOCGRAPH SEGGRAPH# word # Arg. # Pre. # Arg. # Pre. # Para.

NYT 795.9 131.6 87.3 6.40 3.74 23.5CNN/DM 789.9 138.1 85.2 6.30 3.57 24.2

Table 4: Statistics of NYT and CNN/DM datasets. #Arg.: number of arguments in each document or para-graph. # Pre.: number of predicates in each documentor paragraph. # Para.: number of paragraphs in eachdocument. Two datasets have comparable graph size.

We use the base version of BERT model (Devlinet al., 2019) to select candidate answers and we fine-tune the base version of RoBERTa model (Liu et al.,2019) to build our QA model. We take pretrainedmodels from Wolf et al. (2019).

A.2 Human Evaluation GuidelineIn our human evaluation, each human annotator ispresented with 100 news articles. The annotatorsare asked to evaluate four summaries (in randomorder) for each article on two aspects (informative-ness and fluency) on a scale of 1 to 5 (1 being verypoor and 5 being very good). Furthermore, forunfaithfulness, we define three types of unfaithfulerrors and ask annotators to label whether sum-maries contain any type of error. Instructions inTable 5 are given to human judges.

Here are descriptions of the aspects:

• Informativeness: Whether the summary pro-vides enough and necessary content coveragefrom the input article.

• Fluency: Whether the summary is free of ob-vious grammatically incorrect sentences (e.g.,fragments, missing components) that makethe text difficult to read.

• Faithfulness: Whether the summary accordswith the facts expressed in the source.

Page 13: VWRSSHG - arxiv.org

Article: With a Little Extra Cash.

What to do with a bonus? The right thing, of course, is to pay off debts or save it fora time when there are not any bonuses. But in Albany, any financial windfall inviteshordes of legislators hungrily seeking ways to spend it. This has already started tohappen, with lawmakers eyeballing a projected budgetary surplus of just under $1billion – not all that grand when you consider that the total state budget is in theneighborhood of $120 billion, but a healthy number nonetheless.But one essential part of the equation is different this year: a new governor guardingthe state finances. Nobody knows quite yet how Gov. Eliot Spitzer will manage aLegislature that wants to add a lot of its favorite things to his budget before they returnit for his approval. One suggestion: Mr. Spitzer should keep his fist as tightly closedas possible, especially on his new school aid formula and his Medicaid adjustments.(....)

Informativeness:

1 Not relevant to the articlee.g., “editorial on gov eliot spitzer ’s plan to spend it . of new governor guarding statefinances . and to spitzer should keep his fist as tightly closed as possible , especiallyon new school aid formula and his medicaid adjustments .”

3 Relevant, but misses the main point of the articlee.g., “editorial on new gov eliot spitzer ’s new governor guarding state finances . saysspitzer should keep his new school aid formula and his medicaid adjustments”

5 Successfully captures the main point of the articlee.g., “Editorial says New York Gov Eliot Spitzer , faced with projected $ 0 billionbudget surplus , should be tight-fisted and cautious about overspending”

Fluency:

1 Summary is full of garbage fragments and is hard to understande.g., “of new governor guarding state finances . and to spitzer should keep his fist astightly closed as possible , to”

2 Summary contains fragments, missing components but has some fluent segmentse.g., “editorial on gov eliot spitzer ’s plan to spend it . of new governor guarding statefinances . and to spitzer should keep his fist as tightly closed as possible , especiallyon new school aid formula and his medicaid adjustments.”

3 Summary contains some grammar errors but is in general fluente.g., “editorial on any financial windfall invites hordes of legislators hungrily seekingways to spend it . how gov eliot spitzer will manage legislature that wants to add lotof its favorite to his budget before they return it for his approval .”

4 Summary has relatively minor grammatical errorse.g., “article on in any financial windfall invites hordes of legislators hungrily seekingways to spend it”

5 Fluent summarye.g., “editorial says new new jersey gov eliot spitzer guarding state finances . saysspitzer should keep his new school aid formula and his medicaid adjustments”

Faithfulness:

We define three types of unfaithful errors. Each type is labeled as “0” or “1” indepen-dently. “0” means summary does not make this type of error and “1” suggests thistype of error occurs. Three types of errors are :

i Hallucination error: Fabricated content that does not occur in the original articlee.g., “correction of dec 0 about new york column on state budget”

ii Out-of-Context error: Fact occurs in the article, but fails without correct contexte.g., “Editorial says one essential part of the equation is different this year: a newgovernor guarding the tate finances.”

iii Deletion or Substitution error: Summary contains incorrectly edited, missing ele-ments; or summary incorrectly concatenates elements from different sentences.e.g., “editorial says new new jersey gov eliot spitzer guarding state finances, keepinghis new school aid formula adjustments.”

Table 5: Sample summaries with explanations on human evaluation aspect scales, and the definition of three typesof unfaithful errors.