A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Post on 10-May-2015

927 views 0 download

Tags:

description

Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts, however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous. These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a seman- tic best-effort information extraction approach, which targets an information extraction scenario where text information is extracted under a pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency, context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information ex- traction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.

Transcript of A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

A Semantic Best-Effort Approach for Extracting Structured

Discourse Graphs from WikipediaAndré Freitas, Danilo Carvalho, J. C. P. da

Silva, Sean O’Riain, Edward Curry

Digital Enterprise Research Institute www.deri.ie

Outline

Motivation Representation

Requirements Semantic Best-effort Representation

Extraction Graphia Extractor Preliminary Evaluation Extraction Examples

Conclusion

Digital Enterprise Research Institute www.deri.ie

Motivation

Digital Enterprise Research Institute www.deri.ie

Motivation

Digital Enterprise Research Institute www.deri.ie

Motivation

Digital Enterprise Research Institute www.deri.ie

Motivation

Digital Enterprise Research Institute www.deri.ie

Motivation

Linked Data Terminological and structural regularity Shared semantic agreement between data consumers

Natural language texts No terminological or structural

regularity Highly contextualized Complex semantic dependency

relations Ambiguity Information selection/normalization

- vocabulary constraints+ entity-centric+ pay-as-you-go data semantics = semantic best-effort

Digital Enterprise Research Institute www.deri.ie

Motivation

Vocabulary-independent (schema-free queries) How to abstract users from knowing the data

representation? Semantic matching

Schemaless databases in the limit demands vocabulary-independency

How information extraction is reshaped in this scenario?

Digital Enterprise Research Institute www.deri.ie

Motivational Scenario

What is the relationship between Barack Obama and Indonesia?

Sentence: From age six to ten, Obama attended local schools in Jakarta, including Besuki Public School and St Francis Assisi School.

Semantic Best-effort Extraction

Entity-centric text representation

Digital Enterprise Research Institute www.deri.ie

Representation

Digital Enterprise Research Institute www.deri.ie

Computational Linguistics Perspective

What is already there to represent NL? Discourse Representation Theory (DRT) Semantic Role Labeling (SRL)

Digital Enterprise Research Institute www.deri.ie

Discourse Representation Theory (DRT) “The key idea behind (...) Discourse Representation Theory

is that each new sentence of a discourse is interpreted in the context provided by the sentences preceding it.”

van Eijck and Kamp Models propositions in discourse (multiple sentences). Discourse representation structures (DRS).

John enters a card. Every card is green.

Digital Enterprise Research Institute www.deri.ie

Semantic Role Labeling (SRL)

Shallow semantic parsing. Detection of arguments associated with a predicate. Associated semantic types to arguments.

Bill cut his hair with a razor

[Agent Bill] cut [Patient his hair] [Instrument with a razor.]

Digital Enterprise Research Institute www.deri.ie

Semantic Best-Effort

Objectives: Entity-centric & Standardized: easier to integrate with

other resources Remove the formal constraints and the ‘baggage’ from

existing approaches Representation robust to extraction limitations/errors

Digital Enterprise Research Institute www.deri.ie

Semantic Best-Effort Requirements

Text segmentation into (s,p,o)s Context representation Conceptual model independency Resolve co-references (pay-as-you-go) Represent recurrent discourse structures Standardized representation (RDF(S)) Principled interpretation (compositionality)

Digital Enterprise Research Institute www.deri.ie

Examples

- Text segmentation into (s,p,o)s

- Context representation- Resolve co-references (pay-as-you-go)- Conceptual model independency

Digital Enterprise Research Institute www.deri.ie

Examples

- Context representation

Digital Enterprise Research Institute www.deri.ie

Examples

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

- Resolve co-references (pay-as-you-go)

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

Digital Enterprise Research Institute www.deri.ie

SDG Elements

Named, non-named entities and properties Quantifiers & operators Triple Trees Context elements Co-Referential elements Resolved & normalized entities

Digital Enterprise Research Institute www.deri.ie

Graph Patterns

Digital Enterprise Research Institute www.deri.ie

[[Interpretation]]

Graph traversal – deref sequence

Digital Enterprise Research Institute www.deri.ie

Extraction

Digital Enterprise Research Institute www.deri.ie

SBE Graph Extraction Tool

Digital Enterprise Research Institute www.deri.ie

Extraction Pipeline Architecture

Subject Predicate Object Prepositional phrase & Noun complement Reification Time

Digital Enterprise Research Institute www.deri.ie

Preliminary Evaluation

1033 relations (triples) from 150 sentences from 5 randomly selected Wikipedia articles

Manually classified the graphs: error categories and accuracy.

Digital Enterprise Research Institute www.deri.ie

Preliminary Evaluation

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Digital Enterprise Research Institute www.deri.ie

Conclusion

Main direction for improvement is completeness Aligned with the pay-as-you-go scenario

Still need to define clear criteria for what you can’t extract There is still a long way to go (e.g. complex subordination) Investigation using existing n-ary relations patterns Context (reification) should be a first-class citizen in the

representation of natural language Focus on getting the semantic pivots (rigid designators)

right Worth putting effort on enumerable patterns (timestamps,

operators)