Predicting sentence specificity, with applications to news summarization
Ani Nenkova, joint work with Annie Louis
University of Pennsylvania
Motivation A well-written text is a mix of general
statements and sentences providing details
In information retrieval: find relevant and well-written documents
Writing support: visualize general and specific areas
Supervised sentence-level classifier for general/specific Training data
Used existing annotations for discourse relations from PDTB
Features Lexical, language model, syntax, etc
Testing data Annotators judged more sentences
Applications to analysis of summarization output Automatic summaries too specific, worse for that
Training data
Penn discourse tree bank
Penn Discourse Treebank (PDTB)
Largest annotated corpus of explicit and implicit discourse relations
1 million words of Wall Street Journal
Arguments – spans linked by a relation (Arg1, Arg2)
Sense – semantics of the relation (3 level hierarchy)
I love ice-cream but I hate chocolates.(discourse connectives)
I came late. I missed the train.(adjacent sentences in the same paragraph)
5
Distribution of relations between adjacent sentences
(Adjacent sentences linked by an entity. Not considered a true discourse relation.)
6
7
Training data from PDTB Expansions
Expansion
Conjunction[Also, Further]
Restatement[Specifically, Overall]
Instantiation[For example]
List[And]
Alternative[Or, Instead]
Exception[except]
Specification
Equivalence
Generalization Conjunctive Disjunctive
Chosen alternative
7
Instantiation example
The 40 year old Mr. Murakami is a publishing sensation in Japan.
A more recent novel, “Norwegian wood”, has sold more than forty million copies since Kodansha published it in 1987.
8
Examples of general /specific sentences
Despite recent declines in yields, investors continue to pour cash into money funds.
Assets of the 400 taxable funds grew by $1.5 billion during the latest week, to $352 billion. [Instantiation]
By most measures, the nation’s industrial sector is now growing very slowly—if at all.
Factory payrolls fell in September. [Specification]
9
Experimental setup—Two classifiers Instantiations-based
Arg1: General, Arg2: specific 1403 examples
Restatement#Specifications-based Arg1: General, Arg2: specific 2370 examples
Implicit relations only 50% baseline accuracy; 10 fold-cross
validation; Logistic regression
10
Features
Developed from a small development set 10 pairs of specification 10 pairs of instantiation
Features for general vs specific Sentence length: no. of tokens, no. of nouns
Expected general sentences to be shorter
Polarity: no. of positive/ negative/ polarity words, also normalized by length General Inquirer MPQA subjectivity lexicon In dev set, sentences with strong opinion are general
Language models: unigram/ bigram/ trigram probability & perplexity Trained on one year of New York Times news In dev set, general sentences contained unexpected, catchy
phrases12
Features for general vs specific
Specificity min/ max/ avg IDF WordNet: hypernym distance to root for nouns and
verbs—min/ max/ avg
Syntax: No. of adjectives, adverbs, ADJP, ADVP, verb phrases, avg VP length
Entities: Numbers, proper names, $ sign, plural nouns
Words: count of each word in the sentence13
Accuracy of general/specific classifier using Instantiations
50 55 60 65 70 75 80
verbs
sent len.
polarity
syntax
specificity
lang. md.
entities
words
all
all-words
Accuracy
14
Best: 76% accuracy
Accuracy of general/specific classifier using Specifications
50 55 60 65
polarity
verbs
lang. md.
entities
sent len.
specificity
syntax
words
all
all-words
Accuracy15
Best: 60% accuracy
Instantiation based classifier gave better performance
Best individual feature set: words (74.8%) Non-lexical features are equally good: 74.1%
No improvement by combining: 75.8%
16
Feature analysis Words with highest weight [Instantiation-based]
General: number, but, also, however, officials, some, what,
lot, prices, business, were… Specific: one, a, to, co, I, called, we, could, get…
General sentences are characterized by Plural nouns Dollar sign Lower probability More polarity words and more adjectives and adverbs
Specific sentences are characterized by Numbers and names
More testing data
Direct judgments of WSJ and AP sentences on Amazon Mechanical Turk
~ 600 sentences 5 judgments per sentence
Agree TotalWSJ
GeneralWSJ
SpecificWSJ
TotalAP
GeneralAP
SpecificAP
5 96 51 45 108 33 75
4 102 57 45 91 35 56
3 95 52 43 88 49 39
Total 294 160 133 292 117 170
In WSJ, more sentences are general (55%)In AP, more sentences are specific (60%)
Why the difference between Instantiation and Specification? Some of the annotations were on our initial
training data
20
Instantiation (32)
General Specific
Arg1 29 3
Arg2 6 26
Specification (16)
General Specific
Arg1 10 6
Arg2 8 8
Has more detectable properties
associated with Arg1 and Arg2
Accuracy of classifier on new dataExamples
All features
Nonlexical
Words
All features
Nonlexical
Words
5 Agree 90.6 96.8 84.3 69.4 94.4 78.7
4+5 Agree
80.8 88.8 77.7 65.8 89.9 74.8
All 73.7 76.7 71.6 59.2 81.1 67.5
Non-lexical features work better on this dataPerformance is almost the same as in cross validation
Classifier is more accurate on examples where people agreeClassifier confidence correlates with annotator agreement
22
Application of our classifier to full articles
Distribution of general/specific sentences in news documents
Can the classifier detect differences in general/specific summaries by people
Do summaries have more general/specific content compared to input? How does it impact summary quality?
Compare different types of summaries Human abstracts: written from scratch Human extracts: select sentences as a whole from inputs System summaries: all extracts
22
Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later. [general]
The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said.
[specific]
Example general and specific predictions
23
24
Example predictions
The novel, a story of a Scottish low-life narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers who have damned all six books shortlisted for the prize as boring, elitist and – worse of all – unsaleable.
…The Booker prize has, in its 26-year history, always
provoked controversy.
24
Specific
General
Computing specificity for a text
Sentences in summary are of varying length, so we compute a score on word level “Average specificity of words in the text”
25
S1: w12w11 …w13
S2: w22w21 …w23
S3: w32w31 …w33
Confidence for beingin specific class
0.23
0.81
0.680.68 0.68 0.68 0.68
0.23 0.23 0.23 0.23
0.81 0.81 0.81 0.81
Average score on tokens
Specificity score
50 specific and general human summariesText General category Specific category
Summaries 0.55 0.63
Inputs 0.63 0.65
No significant differences in specificity of the input
Significant differences in specificity of summaries in the two categories
Our classifier is able to detect the differences
Data: DUC 2002
Generic multidocument summarization task
59 input sets 5 to 15 news documents
3 types of summaries 200 words Manually assigned content and linguistic quality scores
1. Humanabstracts
27
2. Humanextracts
3. Systemextracts
2 assessors * 59 2 assessors * 59 9 systems * 59
Specificity analysis of summaries
1. More general content is preferred in abstracts
2. Simply the process of extraction makes summaries more specific
3. System summaries are overly specific
28
0.7 0.80.6
Inputs (0.65)
H. Abs (0.62)
S.ext (0.74)
H.ext (0.72)
[Avg. specificity]
Histogram of specificity scores
Human summaries are more general
Is the aspect related to summary quality?
Analysis of ‘system summaries’: specificity and quality
1. Content quality Importance of content included in the summary
2. Linguistic quality How well-written the summary is perceived to be
3. Quality of general/specific summaries When a summary is intended to be general or specific
30
31
Relationship to content selection scores Coverage score: closeness to human summary
Clause level comparison
For system summaries Correlation between coverage score and average
specificity -0.16*, p-value = 0.0006
Less specific ~ better content
But the correlation is not very high
Specificity is related to realization of content Different from importance of the content
Content quality = content importance + appropriate specificity level
Content importance: ROUGE scores N-gram overlap of system summary and human summary Standard evaluation of automatic summaries
32
Specificity as one of the predictors
Coverage score ~ ROUGE-2 (bigrams) + specificity
Linear regression
Weights for predictors in the regression model
33
Mean β Significance (hypothesis β = 0)
(Intercept) 0.212 2.3e-11
ROUGE-2 1.299 < 2.0e-16
Specificity -0.166 3.1e-05
Is the combination a better predictor than ROUGE alone?
2. Specificity and linguistic quality
Used different data: TAC 2009 DUC 2002 only reported number of errors Were also specified as a range: 1-5 errors
TAC 2009 linguistic quality score Manually judged: scale 1 – 10 Combines different aspects
coherence, referential clarity, grammaticality, redundancy
34
What is the avg specificity in different score categories?
More general ~ lower score! General content is useful
but need proper context!
35
Ling score No. summaries
Poor (1, 2) 202
Mediocre (5) 400
Best (9, 10) 79
If a summary starts as follows:“We are quite a ways from that, actually.”As ice and snow at the poles melt, …
Specificity = lowLinguistic quality = 1
Average specificity
0.71
0.72
0.77
Data for analysing generalization operation Aligned pairs of abstract and source sentences
conveying the same content Traditional data used for compression experiments
Ziff-Davis tree alignment corpus 15964 sentence pairs Any number of deletions, up to 7 substitutions
Only 25% abstract sentences are mapped But beneficial to observe the trends
36
[Galley & McKeown (2007)]
Generalization operation in human abstracts
Transition
SS
SG
GG
GS
37
One-third of all transformations are specific to general
Human abstracts involve a lot of generalization
No. pairs % pairs
6371 39.9
5679 35.6
3562 22.3
352 2.2
How specific sentences get converted to general?
SG
SS
GG
GS
38
Orig. length
33.5
33.4
21.5
22.7
New/orig length
40.8
56.6
60.8
66.0
Avg. deletions(words)
21.4
16.3
9.3
8.4
Choose long sentences and compress heavily!
A measure of generality would be useful to guide compression Currently only importance and grammaticality are used
Use of general sentences in human extracts
Details of Maxwell’s death were sketchy. Folksy was an understatement. “Long live democracy!” Instead it sank like the Bismarck.
Example use of a general sentence in a summary…With Tower’s qualifications for the job, the nominations should
have sailed through with flying colors. [Specific]Instead it sank like the Bismarck. [General]…Future: can we learn to generate and select general sentences to
include in automatic summaries?
Conclusions Built a classifier for general and specific
sentences Used existing annotations to do that But tested on new data and task-based
evaluation
The confidence of the classifier is highly correlated with human agreement
Analyzed human and machine summaries Machine summaries are too specific But adding general sentences is difficult because
the context has to be right
Further details in Annie Louis and Ani Nenkova, Automatic identification of general and specific
sentences by leveraging discourse annotations, Proceedings of IJCNLP, 2011 (To Appear).
Annie Louis and Ani Nenkova, Text specificity and impact on quality of news summaries, Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011.
Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment, Proceedings of NAACL-HLT 2010.
Two types of local coherence—Entity & Rhetorical
Local coherence: Adjacent sentences in a text flow from one to another
Entity – same topic John was hungry. He went to a restaurant.
But only 42% sentence pairs are entity-linked [previous corpus studies]
Will core discourse relations connect the non-entity sharing sentence pairs? Popular hypothesis in prior work
42
Investigations into text quality The mix of discourse relations in a text is
highly predictive of the perceived quality of the text
Both implicit and explicit relations are needed to predict text quality
Predicting the sense of implicit discourse relations is a very difficult task; most predicted to be “expansion”
How is local coherence created?
Joint analysis by combining PDTB and Ontonotes annotations 590 articles Noun phrase coreference from Ontonotes
40 to 50% of sentence pairs do not share entities in articles of different lengths
44
Expansions cover most of non-entity sharing instances
45
Expansions have the least rate of coreference
46
Rate of coreference in 2nd level elaboration relations
47
Example instantiations and list relations
Instantiation The economy is showing signs of weakness, particularly
among manufacturers.
Exports which played a key role in fueling growth over the last two years, seem to have stalled.
List Many of Nasdaq's biggest technology stocks were in the
forefront of the rally.
- Microsoft added 2 1/8 to 81 3/4 and Oracle Systems rose 1 1/2 to 23 1/4.
- Intel was up 1 3/8 to 33 3/4.
48
Overall distribution of sentence pairs among the two coherence devices
49
30% sentence pairs have no coreference and are in a weak discourse relation (expansion/entrel)
We must explore elaboration more closely to identify how they create coherence
Top Related