Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha...

Estimating Importance Featuresfor Fact Mining

(With a Case Study in Biography Mining)

Sisay Fissaha AdafreSchool of Computing

Dublin City [email protected]

Maarten de RijkeISLA, University of Amsterdam

[email protected]

Outline

Motivation Task Approaches Experimental Setup Results Concluding Remarks

Motivation

Over 60% of Web queries are informational “Tell me about X.” Queries are short

TREC “Other” Questions DUC 2004 – Summarization

“Who is X?”

Motivation Increasing amount of user annotated data - Wikipedia

The largest reference work Open content Anyone can edit its content Rich set of categories

Wikipedia as an “importance model” (Mishne et al. 2005) Nuggets from a news paper corpus are compared with nuggets from Wikipedia.

Higher similarity implies importance.

Applications Uses of sentence importance estimation Information retrieval

Question Answering (Ahn et al., 2004) Novelty checking (Allan, et al., 2003)

Summarization Graph-based methods (Erkan & Radev, 2004)

Topic Tracking (Kraaij & Spitters, 2003)

Task

Given a topic, identify sentences that are important for the topic, in a general newspaper text corpus Example

“William H. McNeill” (born 1917,Vancouver, British Columbia) is a Canadian historian.

He is currently Professor Emeritus of History at the University of Chicago.

McNeill’s most popular work is “The Rise of the West”. The book explored human history in terms of the effect

of different old world civilizations on one another, and especially the dramatic effect of western civilization on others in the past 500 years.

It had a major impact on historical theory, especially in distinction to Oswald

Scientific aim: To compare techniques for determining important sentences

System OverviewPassage retrieval Get Wikipedia categories

Select sample articlesSentence extraction

Candidate sentences Reference sorpus

Rank sentences

Ranked sentences

Candidate Sentence Selection Input

Topic Name and category of a person

Source corpus AQUAINT Corpus

Sentence extraction Source corpus split into passages and indexed

The topic is submitted as query Top 200 passages selected Passages are split into sentences Sentences containing the topic words are retained

Sentence Ranking Sentences are ranked based on their similarity with reference sentences

Reference sentences Given a topic, and its category

Brad Pitt, and Actor Reference corpus is a set of sentences describing other entities in the same category, i.e., other actors.

System OverviewPassage retrieval Get Wikipedia categories

Select sample articlesSentence extraction

Candidate sentences Reference sorpus

Rank sentences

Ranked sentences

Ranking sentences Two dimension

Graph-based vs non-graph-based Using (or not) a reference corpus

Five ways Word overlap Language Modelling Graph-based methods

Generic Graph-based method with reference corpus Graph-based method with reference corpus plus lexical

layer

Assumptions Given an entity of some category

We consider other entities of the same category and the properties that are typically described for them.

That is, if a property is included in the descriptions of a significant portion of entities in the same category as our input entity, we assume it to be an important one.

Sentence Ranking Similarity Measures

Word Overlap Compute Jaccard coefficient b/n candidate and references sentences

Sentences are ranked by their maximum scores

Language modelling Sentences are ranked by their likelihood w.r.t. the language model of the reference corpus

Graph-based method …

Sentence Ranking Graph-based method for summarization (Erkan & Radev, 2004) Given a text to be summarized Construct a graph by linking related sentences Word overlap

Assign score to each sentence using PageRank The sentence with highest PageRank score is assumed to contain the salient information

Sentence Ranking

Graph-based method

T1 T3T2

T4 T7T5

Target sentences

T6

Target sentences

Reference sentences

T1 T3T2

R2R1 R3

R4

T1 T3T2

Target sentences

Reference sentences

R1 R3R2

R4

W1 W3W2

Generic methodwithout

reference corpusWith

reference corpusWith

lexical level

Research questions Does the use of reference corpora help in improving importance estimation?

Do graph-based estimation methods outperform non-graph-based methods?

Does the additional representation of important lexical items help improve importance estimation for sentences?

Experimental Setup Data set

TREC data Set? Preliminary experiment

Some important snippets not included, Eg. Fred Durst: Born in Jacksonville, Fla.,

Durst grew up in Gastonia, N.C., where his love of hip-hop music and break dancing made him an outcast.

Eileen Marie Collins: She was born Nov. 19, 1956, in Elmira, N.Y., to Jim and Rose Collins.

New data set 30 Topics – Persons 10 Occupations

Experimental Setup Assessment

Take the top 20 snippets returned by the different systems

Manually assess each snippet for important biographical information Two assessors

Assessors were allowed to examine the topic in Wikipedia or using a general purpose web search engine. Agreement – Kappa = 0.70

Baseline Rank sentence based on the retrieval scores (Performed well at TREC 2003)

Results

Methods WOD WD

Non Graph-based

Baseline 156 243

Word-overlap 149 302

Language Model 152 264

Graph-based

Generic 99 252

Weighted Graph 203 322

Weighted Graph with Lexical Rep. 209 318

600 total snippets for each runs Two Score

WOD – with out duplicates WD – with duplicates

Summary of importance estimation methods Word-overlap

Based on single sentence Returns several duplicates

Language Modelling Based on the combined corpus Does not distinguish between sentences Less effective

Generic graph-based method Do not use on the reference corpus Based on redundancy in the news corpus

Graph-based + reference-corpus Combine evidence from multiple sentences

Concluding Remark Task: estimating importance of sentences

Main finding: combination of a corpus-based approach to capturing the knowledge encoded in sentences known to be important and graph-based method for ranking sentences performs best

Thank you

Result Significant differences?

Significance Level0.05 0.01

Baseline GenericWord overlap GenericLanguage Model GenericGeneric - -Weighted Graph Baseline

GenericWord overlapLanguage Model

Graph with Lexical BaselineGenericWord overlapLanguage Model

Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha...

Documents

Transcript of Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha...