Post on 21-Dec-2015
1
Some interesting directions in Automatic Summarization
Annie Louis
CIS 430
12/02/08
2
Today’s lecture
Multi-strategy summarizationIs one method enough?
Performance Confidence EstimationWould be nice to have an indication of expected system
performance on an input
Evaluation without human modelsCan we come up with cheap and fast evaluation
measures?
Beyond generic summarizationQuery focused, updates, blogs, meeting, speech..
3
Relevant Papers :Lacatusu et al. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi-
Document Summarization. In Proceedings of the Document Understanding Workshop (DUC-2006)
McKeown et al. Columbia multi-document. summarization: Approach and evaluation. In Proceedings of the Document Understanding Conference (DUC01), 2001.
Nenkova et al. Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization. In Proceedings of ACL-08: HLT
4
More about DUC 2002 data…
/project/cis/nlp/tools/Summarization_Data/Inputs2002
Newswire texts
Has 3 categories of inputs !
?
5
DUC 2002 input categoriesSingle event - 30 inputs
Eg: d061- Hurricane Gilbert Same place, roughly same time, same actions
Multiple distinct events – 15 inputs Eg: d064 - Opening of Mac Donald’s at Russia, Canada, South
Korea.. Different places, different times, different agents
Biographies – 15 inputs Eg: d065 – Dan Quayle, Bush’s nominee for vice president One person – one event, background info – events from the past
Do you think a single method will do well for all ?
6
Tf-idf summary - d061Hurricane Gilbert Heads Toward Dominican Coast . Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a
hurricane Saturday night. Gilbert Reaches Jamaican Capital With 110 Mph Winds . Hurricane warnings were posted for the Cayman Islands, Cuba and Haiti. Hurricane Hits Jamaica With 115 mph Winds; Communications. Gilbert reached Jamaica after skirting southern Puerto Rico, Haiti and the
Dominican Republic. Gilbert was moving west-northwest at 15 mph and winds had decreased to 125
mph. What Makes Gilbert So Strong? With PM-Hurricane Gilbert, Bjt . Hurricane Gilbert Heading for Jamaica With 100 MPH Winds . Tropical Storm Gilbert
7
Tf-idf summary - d064First McDonald's to Open in Communist Country . Police Keep Crowds From Crashing First McDonald's . McDonald's and Genex contribute $1 million each for the flagship
restaurant. A Bolshoi Mac Attack in Moscow as First McDonald's Opens . McDonald's Opens First Restaurant in China . McDonald's hopes to open a restaurant in Beijing later. The 500-seat McDonald's restaurant in a three-story building is
operated by McDonald's Restaurant Shenzhen Ltd., a wholly owned subsidiary of McDonald's Hong Kong.
McDonald's Hong Kong is a 50-50 joint venture with McDonald's in the United States.
McDonald's officials say it is not a question that
8
Tf-idf summary - d065
Tucker was fascinated by the idea, Quayle said. But Dan Quayle's got experience, too. Quayle's Triumph Quickly Tarnished . Quayle's Biography Inflates State Job; Quayle Concedes Error . Her statement was released by the Quayle campaign. But he would go no further in describing what assignments he
would give Quayle. ``I will be a very close adviser to the president,'' Quayle said. ``You're never going to see Dan Quayle telling tales out of It was everything Quayle had hoped for. Quayle had said very little and he had said it very well. There are windows into the workings of the
9
Multi-strategy summarization
Multiple summarization modules within a single systemBetter than a single method
How to employ a multi-strategy system?Use all methods, produce multiple summaries, choose bestUse a router and summarize by only one specific method
10
Produce multiple summaries and choose – LCC GISTexter
Task - Query focused summarizationQuery is decomposed by 3 methodsSent to a QA system and a multi-document
summarizer6 different summaries
Select the best summaryTextual entailment + pyramid scoring
11
Route to a specific module – Columbia’s multi-document summarizerFeatures to classify an input as
Single eventBiographyLoosely connected documents
The result of classification is used to route the input to one of 3 different summarizers
12
Features - Single Event
To identifyTime span between publication dates < 80 daysMore than 50% documents published on same day
To summarizeExploit redundancy, cluster similar sentences into themesRank themes based on size, similarity, ranking of
contained sentences by lexical chainsSelect phrases from each themeGenerate sentences
13
Features - Biographies
To identifyFrequency of most frequent capitalized letter > X
(compensate for NE)Frequency of personal pronouns > Y
To summarizeTarget individual mentioned in sentence ?Another individual found in the sentence ?Position of most prominent capitalized word in the
sentence
14
Features – Weakly related documentsTo identify
Not single event nor biographicalTo summarize
Words likely to be used in first paragraphs ie important words – learnt from corpus analysis
Verb specificitySemantic themes – wordnet conceptsPositional and length featuresMore weight to recent articlesDownweight sentences with pronouns
15
Characterizing/ Classifying inputsImportant if you want to route to a specialized
summarizerClassification can be made along several lines
Theme of input – Columbia’s summarizerScientific/ News articlesLong/ Short documentsNews articles about events/ EditorialsDifficult/ Easy ??
16
Input difficulty and Performance Confidence Estimation
Some inputs are more difficult than others
– Most summarizers produce poor summaries for these inputs
17
Input to summarizer Some inputs are easier than others !
Average system scores obtained on different inputs for 100 word summaries
mean 0.55
min 0.07
max 1.65
Data: DUC 2001 score range 0 - 4
18
Input difficulty & Content coverage scores Content coverage score
extent of coverage of important content
Poor content selection –> low score
If most summaries for an input get low score.. Most systems could not identify important content “ Difficult Input ”
19
Multi-document inputs were from 5 categories:A set of documents describing...
Single event The Exxon Valdez Oil Spill
Subject Mad Cow Disease
Biographical Elizabeth Taylor
Multiple distinct events Different occasions of police misconduct
Opinion Views of the senate, public, congress, lawyers etc on the decision by the senate to count illegal aliens in the 1990 census
Single task – generic summarization
Did system performance vary with DUC 2001 input categories?
Cohesive / “On topic” Inputs
Non Cohesive/ “Multiple facets” Inputs
20
Input type influenced scores obtained
Biographical Single event Subject
are easier to summarize than
Multiple distinct events Opinions
21
Cohesive inputs are easier to summarize
Cohesive Biographical Single event Subject
Non Cohesive Multiple distinct events Opinions
Scores for cohesive inputs are significantly* higher than those for non-cohesive inputs at 100, 200 and 400 words
*One sided t-tests95% significance level
22
Inputs can be easy or difficult =>
Better summarizers ~ different methods to summarize different inputs multi-strategy
Enhancing user experience ~ system can flag summaries that are likely to be poor in content low system confidence on difficult inputs
23
First step..
What characterizes difficult inputs?Find useful features
Can we identify difficult inputs with high accuracy?Classification task – difficult vs easy
24
Features – Simple length-based
Smaller inputs ~ less loss of information
~ better summaries
Number of sentences ~ information to be captured in the summary
Vocabulary size~ number of unique words
25
Features – Word distributions in input
% of words used only once~ lexical repetition
less repetition of content ~ difficult inputs
Type- token ratio~ lexical variation in the input
fewer types ~ easy inputs
Entropy of the input
~ descriptive words ~ high probabilities ~ less entropy ~ easy
ni
iii wpwpXH
12 )(log)()(
26
Features – Document similarity and
relatedness
documents with overlapping content ~ easy input
Pair-wise cosine overlap (average, min, max) ~ similarity of the documents
High cosine overlaps overlapping content easy to summarize
v 1 , v 2 tf − idf weights of content words of 2 documents21
21.cosvv
vv
27
Features – Document similarity and relatedness
tightly-bound by topic ~ easy input
KL Divergence~ distance from a large collection of random documents
Difference between 2 language models input & random collection
Greater divergence input is unlike random documents, tightly bound input
Inpw coll
inpinp wp
wpwpceKLdivergen
)(
)(log)( 2
coll − all docum ents from all task sof DUC 2001 to 2006
28
Features – Log likelihood ratio based
more topic terms, similar topic terms ~ topic-oriented, easy input
Number of topic signature terms
Percentage of topic signatures in the vocabulary~ control for length of the input
Pair-wise topic signature overlap (average, min, max)~ similarity between the topic vectors of different documents
~ cosine overlap with reduced & specific vectors
29
What makes some inputs easy?Easy inputs have
smaller vocabulary smaller entropy greater divergence from a random collection higher % of topic signatures in the vocabulary higher avg cosine and topic overlap
30
Input difficulty hypothesis for systems
Indicator of an input’s difficulty Average system coverage score Difficult, if most systems select poor content
Defining difficulty of inputs 2 classes Abv/ Below “ mean average system score ”
> mean score – easy
< mean score – difficult Equal classes
31
Classification Results
Baseline performance : 50%
Test set: DUC 2002 - 04
10 fold cross validation on 192 observations
Precision and recall of difficult inputs
Accuracy 69.27
Precision 0.696
Recall 0.674
32
Summary Evaluation without Human ModelsCurrent Evaluation Measures - Recap
Content CoveragePyramidResponsivenessROUGE
* My work with Ani
33
Need for cheap, fast measures
All current evaluations require human effortHuman summaries (content overlap, pyramid, rouge)Manual marking of summaries (responsiveness)
Human summaries are biasedseveral summaries for the same input are needed to
remove bias (Pyramid, ROUGE)
Can we come up with cheaper evaluation techniques that will produce the same rankings for systems as human evaluations ?
34
Compare with input – No human modelsEstimate closeness of summary to input
The more close a summary is to the input, the better its content should be
How do we verify this ?Design some features that can reflect how close a summary
is to the input Rank summaries based on the value of this featureCompare the obtained rankings to rankings given by
humansSimilar rankings (high correlation) – you have succeeded
35
What features should we use?We want to know how well a summary reflects the
input’s content.
Guesses ?
36
Features - Divergence between input and summary
Smaller divergence ~ better summary
KL divergence input – summaryKL divergence summary – input
Jensen Shannon Divergence
)()()(|| 21
21
21
21 SummHInpHSummInpHSummInp
37
Features – Use of topic words from the input
More topic words ~ better summary
% of summary composed of topic words% of input’s topic words carried over to the summary
38
Features – Similarity between input and summary
More similar to the input ~ better summary
Cosine similarity input – summary words
Cosine similarity input’s topic signatures – summary words
39
Features - Summary Probability
Higher likelihood of summary given input ~ better summary
Unigram summary probability
Multinomial summary probability
ii
r
nrInp
nInp
nInp
wwordofsummaryincountn
sizesummaryNnn
vocabularysummaryr
wpwpwp r
1
21 )()()( 21
r
r
nrInp
nInp
nInpnn
N wpwpwp )()()( 21
1 21!!!
40
Analysis of featuresThe value of the feature will be the score for the
summaryAverage the feature values for a particular system
over all inputsCompare to average human scoreSpearman (rank) correlation
41
Results
Feature Pyramid Responsiveness
JSD -0.8803 -0.7364% input’s topic in summary -0.8741 -0.8249KL div summ-input -0.7629 -0.6941Cosine overlap 0.7117 0.6469% of topic summary 0.7115 0.6015KL div input – summ -0.6875 -0.5850Unigram summ prob -0.1879 -0.1006Mult. Summ prob 0.2224 0.2353
TAC 2008 Query focused summarization
48 inputs, 57 systems
42
Evaluation without human modelsComparison with input – correlates well with human
judgementsCheap, fast, unbiasedNo human effort needed
43
Other summarization tasks of interest
Update summariesThe user has read a set of documents AProduce a summary of updates from a set B of documents
published later in time
Query focusedA topic statement is given to focus content selection
44
Other summarization tasks of interest
Blog/ Opinion SummarizationMine opinions, good/ bad product reviews etc
Meeting/ Speech SummarizationHow would you summarize a brainstorming session ?
45
What you have learnt today.. How simple features you already know can be put to
use for interesting applicationsBeyond a simple sentence extractor engine –
customizing for inputs/ user/ task-setting is important
There are a lot of interesting tasks in summarization and language processing