Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha...
-
Upload
chester-cain -
Category
Documents
-
view
214 -
download
0
Transcript of Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha...
Estimating Importance Featuresfor Fact Mining
(With a Case Study in Biography Mining)
Sisay Fissaha AdafreSchool of Computing
Dublin City [email protected]
Maarten de RijkeISLA, University of Amsterdam
Outline
Motivation Task Approaches Experimental Setup Results Concluding Remarks
Motivation
Over 60% of Web queries are informational “Tell me about X.” Queries are short
TREC “Other” Questions DUC 2004 – Summarization
“Who is X?”
Motivation Increasing amount of user annotated data - Wikipedia
The largest reference work Open content Anyone can edit its content Rich set of categories
Wikipedia as an “importance model” (Mishne et al. 2005) Nuggets from a news paper corpus are compared with nuggets from Wikipedia.
Higher similarity implies importance.
Applications Uses of sentence importance estimation Information retrieval
Question Answering (Ahn et al., 2004) Novelty checking (Allan, et al., 2003)
Summarization Graph-based methods (Erkan & Radev, 2004)
Topic Tracking (Kraaij & Spitters, 2003)
Task
Given a topic, identify sentences that are important for the topic, in a general newspaper text corpus Example
“William H. McNeill” (born 1917,Vancouver, British Columbia) is a Canadian historian.
He is currently Professor Emeritus of History at the University of Chicago.
McNeill’s most popular work is “The Rise of the West”. The book explored human history in terms of the effect
of different old world civilizations on one another, and especially the dramatic effect of western civilization on others in the past 500 years.
It had a major impact on historical theory, especially in distinction to Oswald
Scientific aim: To compare techniques for determining important sentences
System OverviewPassage retrieval Get Wikipedia categories
Select sample articlesSentence extraction
Candidate sentences Reference sorpus
Rank sentences
Ranked sentences
Candidate Sentence Selection Input
Topic Name and category of a person
Source corpus AQUAINT Corpus
Sentence extraction Source corpus split into passages and indexed
The topic is submitted as query Top 200 passages selected Passages are split into sentences Sentences containing the topic words are retained
Sentence Ranking Sentences are ranked based on their similarity with reference sentences
Reference sentences Given a topic, and its category
Brad Pitt, and Actor Reference corpus is a set of sentences describing other entities in the same category, i.e., other actors.
System OverviewPassage retrieval Get Wikipedia categories
Select sample articlesSentence extraction
Candidate sentences Reference sorpus
Rank sentences
Ranked sentences
Ranking sentences Two dimension
Graph-based vs non-graph-based Using (or not) a reference corpus
Five ways Word overlap Language Modelling Graph-based methods
Generic Graph-based method with reference corpus Graph-based method with reference corpus plus lexical
layer
Assumptions Given an entity of some category
We consider other entities of the same category and the properties that are typically described for them.
That is, if a property is included in the descriptions of a significant portion of entities in the same category as our input entity, we assume it to be an important one.
Sentence Ranking Similarity Measures
Word Overlap Compute Jaccard coefficient b/n candidate and references sentences
Sentences are ranked by their maximum scores
Language modelling Sentences are ranked by their likelihood w.r.t. the language model of the reference corpus
Graph-based method …
Sentence Ranking Graph-based method for summarization (Erkan & Radev, 2004) Given a text to be summarized Construct a graph by linking related sentences Word overlap
Assign score to each sentence using PageRank The sentence with highest PageRank score is assumed to contain the salient information
Sentence Ranking
Graph-based method
T1 T3T2
T4 T7T5
Target sentences
T6
Target sentences
Reference sentences
T1 T3T2
R2R1 R3
R4
T1 T3T2
Target sentences
Reference sentences
R1 R3R2
R4
W1 W3W2
Generic methodwithout
reference corpusWith
reference corpusWith
lexical level
Research questions Does the use of reference corpora help in improving importance estimation?
Do graph-based estimation methods outperform non-graph-based methods?
Does the additional representation of important lexical items help improve importance estimation for sentences?
Experimental Setup Data set
TREC data Set? Preliminary experiment
Some important snippets not included, Eg. Fred Durst: Born in Jacksonville, Fla.,
Durst grew up in Gastonia, N.C., where his love of hip-hop music and break dancing made him an outcast.
Eileen Marie Collins: She was born Nov. 19, 1956, in Elmira, N.Y., to Jim and Rose Collins.
New data set 30 Topics – Persons 10 Occupations
Experimental Setup Assessment
Take the top 20 snippets returned by the different systems
Manually assess each snippet for important biographical information Two assessors
Assessors were allowed to examine the topic in Wikipedia or using a general purpose web search engine. Agreement – Kappa = 0.70
Baseline Rank sentence based on the retrieval scores (Performed well at TREC 2003)
Results
Methods WOD WD
Non Graph-based
Baseline 156 243
Word-overlap 149 302
Language Model 152 264
Graph-based
Generic 99 252
Weighted Graph 203 322
Weighted Graph with Lexical Rep. 209 318
600 total snippets for each runs Two Score
WOD – with out duplicates WD – with duplicates
Summary of importance estimation methods Word-overlap
Based on single sentence Returns several duplicates
Language Modelling Based on the combined corpus Does not distinguish between sentences Less effective
Generic graph-based method Do not use on the reference corpus Based on redundancy in the news corpus
Graph-based + reference-corpus Combine evidence from multiple sentences
Concluding Remark Task: estimating importance of sentences
Main finding: combination of a corpus-based approach to capturing the knowledge encoded in sentences known to be important and graph-based method for ranking sentences performs best
Thank you
Result Significant differences?
Significance Level0.05 0.01
Baseline GenericWord overlap GenericLanguage Model GenericGeneric - -Weighted Graph Baseline
GenericWord overlapLanguage Model
Graph with Lexical BaselineGenericWord overlapLanguage Model