Post on 27-Jan-2015
description
MotivationData on the Web
10/04/23 ICWE 2013, Aalborg, Denmark
Some eyecatching opener illustrating growth and or diversity of web data
Summaries on the fly: Query-based Extraction of Structured Knowledge
from Web DocumentsICWE 2013: International Conference on Web Engineering
8-12 July 2013, Aalborg , Denmark
Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze(L3S Research Center, DE)
Outline
– Introduction
– Related Work
– Focused Knowledge Extraction
• Pre-Processing & Query Expansion
• Pattern Generation
• Contextual Structure
– Evaluation
– Results
– Conclusions
10/04/23 ICWE 2013, Aalborg, Denmark
Introduction
• Motivation
– Large amounts of textual Web Documents
– Efficient techniques querying for relevant information
– Extraction of chunks of text: relations, named entities etc.
– Summaries as means on highlighting most important chunks of text
• Issues:
– Summaries as non-structured text
– Weak relationship of user interests and importance of specific chunks of
text in a corpus
10/04/23 ICWE 2013, Aalborg, Denmark
Prominent Text Summarisation Approaches
• Heuristics for relation extraction
• Extraction of information based on predefined templates
• Sentence inclusion based on inclusion of specific terms
• Latent Semantic Analysis (LSA) for measuring importance of specific terms
• Tree Kernels encoding relevant information for event detection
• Latent Dirichlet Allocation (LDA) for topic modelling
• Populating ontologies based on extracted information from text
10/04/23 ICWE 2013, Aalborg, Denmark
IE
IR
ML
SW
Focused Knowledge Extraction Overview
• Structured Summary Generation Components:
– Query Expansion and Reformulation
– Named Entity Definition and Co-Reference Resolution
– Pattern Generation
– Contextual Structure of Summaries
10/04/23 ICWE 2013, Aalborg, Denmark
Focused Knowledge ExtractionPipeline
10/04/23 ICWE 2013, Aalborg, Denmark
Stem Cell
user queryAnatomical structureBiotechnologyCloningCell biologyDevelopmental BiologyStem Cell
query typing and expansion
Corpus
OR/AND of expanded query terms
NERPOS
Annotate
filtered documents patterns
Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500 000 children → lack→ health insurance → enroll → 900 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research.
structured summary
Entities Actions
Focused Knowledge ExtractionQuery Expansion
• Query (“Stem Cell”) → NER → http://dbpedia.org/page/Stem_cell
• Query Typing & Expansion– DBpedia SPARQL Query Expansion:
• Query: “Stem Cell” is processed into:– Typed Query:
• http://dbpedia.org/page/Stem_cell
– Expanded Query:• http://dbpedia.org/page/Biotechnology• http://dbpedia.org/page/Cloning• http://dbpedia.org/page/Cell_biology• http://dbpedia.org/page/Developmental_biology
– Conjunction/Disjunction of expanded query terms
10/04/23 ICWE 2013, Aalborg, Denmark
SELECT ?o ?label WHERE{ <http://dbpedia.org/resource/Stem_cell> ?p ?o . ?o rdfs:label ?label }
Focused Knowledge Extraction - Named Entity Definitions & Co-Reference Resolution
• Entities recognised using NER&NED tools (Stanford’s NLP toolkit)
• Construct a co-occurrence matrix of proper nouns appearing consecutively
• Sample entities: “Chicago Bears”, “playoff games”
• Co-reference resolution crucial for accurate knowledge extraction
10/04/23 ICWE 2013, Aalborg, Denmark
k
iii termtermoccurrcoiMiscentity
11),(][
Focused Knowledge ExtractionPattern Generation
• Determine topic terms (LDA) from the
underlying filtered corpus
• Annotate using POS taggers topic terms
• Pattern items:
– POS tags from topic terms
– Query terms (incl. terms after expansion)
10/04/23 ICWE 2013, Aalborg, Denmark
police found women men dr death people drug medical officers man problems study killed heart hospital test sex patients evidence dead drugs officer….
police_NN found_VBD women_NNS men_NNS dr_VBP death_NN people_NNS drug_NN medical_JJ officers_NNS man_NN problems_NNS study_NN killed_VBD heart_NN hospital_NN test_NN sex_NN patients_NNS evidence_NN dead_NN drugs_NNS officer_NN
NN → VBD → NNS → VBP → NN….Stem Cell → Anatomical structure → Biotechnology Cloning → Cell Biology → Developmental Biology
Focused Knowledge ExtractionPattern Generation (I)
• Construct co-occurrence matrix of pattern items (POS tags, Query terms)
• Generate automatically emerging patterns reflecting syntactical relevance
of chunks of text
• Patterns as a sequence of co-occurring items, modelled as directed tree
graphs
• For each pattern item generate a directed tree graph, considering it as a
root node
• Patterns score conveys importance for a given corpus and query
10/04/23 ICWE 2013, Aalborg, Denmark
Generated Patterns Pattern Score ψscore
NN → JJ → VB → RB 0.28571429NN → VB → JJ → RB 0.19949495Stem Cell → NN → VB → RB → JJ 0.17361111JJ → RB → VB → NN → Stem Cell 0.17347462RB → JJ → NN → Stem Cell 0.16466599NN → Stem Cell → RB → VB → JJ 0.16155811RB → VB → Stem Cell → NN → JJ 0.16129665
10/04/23 ICWE 2013, Aalborg, Denmark
Focused Knowledge ExtractionPattern Generation (II)
Automatically generated patterns showing sequence of important syntactical items to appear in a sentence
Scoring mechanism of patterns as the marginal probability of co-occurring pattern items based on the filtered corpus
Prior probability of a pattern item, as the head node of the directed tree graph.
Conditional probability of two consecutive pattern items
Focused Knowledge ExtractionContextual Structure of Summaries
• Summaries generated as structured knowledge
• Decomposition of summaries into two structures:
– global (Entities, Actions) for entire corpus
– local (entity-context, action-context) for particular document
• Multiple summary perspectives based on generated context
• Enrichment with additional information from reference datasets (DBpedia)
10/04/23 ICWE 2013, Aalborg, Denmark
Focused Knowledge ExtractionContextual Structure of Summaries
10/04/23 ICWE 2013, Aalborg, Denmark
Contextual Structure of Summaries with global and local structures enabling multiple summary perspectives:“The kinds of stem cell therapies being researched for the most part do not involve the politically sensitive use of embryonic stem cells.”
Stem cellTherapies
researchedinvolve
Stem Cell:Embryonic, sensitive
researched: Stem cell therapies ↔ most part
Evaluation Setup
• Dataset: New York Times, year 2007
• 40,000 articles with manually generated summaries
• Summary relevance w.r.t the generated context (query)
• Coverage of the manually NYT generated summaries
• ROGUE-n metric to measure coverage of structured vs. manually generated
summaries
10/04/23 ICWE 2013, Aalborg, Denmark
Total n-grams
Matching n-grams from structured and manually generated summaries.
Results
• 10 queries used for evaluation (2007’s prominent events from Time’s
Magazine1)
• Human evaluation for summary relevance: 76% correctly generated
• 17 evaluators with an average of 20 summaries evaluated
1http://www.time.com/time/specials/2007/0,28757,1686204,00.html
10/04/23 ICWE 2013, Aalborg, Denmark
Query European Union
Super Bowl
US Congress
Virgina Tech
Stem Cell
Protest Harry Potter
Global Warming
National Security
Terrorist Attacks
#Q. Terms 7 13 17 28 5 2 22 5 0 0
#Doc. 157 370 13 12 105 129 10 198 250 57
#Summ. 129 325 19 11 86 103 7 170 207 52
Generated structured summaries for the different queries.
Results
• ROGUE-1 evaluation results for the 10 queries
• 25% precision and 32% recall as best performing results for ROGUE-1
10/04/23 ICWE 2013, Aalborg, Denmark
P/R/F1 measures based on ROGUE-1 metric for the 10 queries used for evaluation
ResultsSample Generated Summaries
10/04/23 ICWE 2013, Aalborg, Denmark
Query: “Stem Cell”
Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500, 000 children → lack → health insurance → enrol → 900, 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research.
Congress’s Shift in Power → revives → Medicare Debate House Democrats → try to rush → legislation → requiring → government → negotiate → lower drug prices for Medicare beneficiaries → overturning → President Bush’s restrictions on embryonic stem cell research.
The nation → welcome → ambitious agenda → being offered → today by the new Congress Democratic majority → raising → minimum wage → advancing → stem cell research → restoring → oversight of the executive branch.
New study → suggesting → useful stem cells → be derived → amniotic fluid without → destroying → embryos.Swarns, Rachel L → announced → 9 Aug. federal government → pays → studies on stem cell colonies , lines → created before→ that date, government → does not encourage → destruction of additional embryos .
Stem cell research → has not produced → a single medical treatment → is morally wrong→ to create human life → to destroy → for research.
The measure → allow → scientists → receiving → federal funds → use → embryonic stem cells from surplus embryos → generated → fertility clinics , after cell lines → had been derived → by others → using → nonfederal funds.
Conclusions
• Query-based generated summaries
• Contextualised Structured Summaries
– Typing and expanding of queries using reference datasets
– Automated pattern generation
• Incorporated user interests and syntactical relevance of chunks of text
• Multiple summary perspectives
• Overall good accuracy of generated summaries
• Infer new knowledge by interlinking summaries of different/same contexts
10/04/23 ICWE 2013, Aalborg, Denmark
Thank you!Questions?
10/04/23 ICWE 2013, Aalborg, Denmark