TALP EL System in TAC-KBP 2013anaderi/KBP2013-poster.pdf · TALP EL System in TAC-KBP 2013 This...

1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com RESULTS REFERENCES S. Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague. C. Fellbaum. 1998. Wordnet: An electronic lexical database. In MIT Press. J. Hoffart, , M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. 2011. Robust disambiguation of named entities in text. In the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, Scotland. J. Hoffart, F. M. Suchanek, K. Berberich, and G.Weikum. 2013. Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. In Artificial Intelligence Journal. L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL. ACKNOWLEDGMENTS This work has been produced with the support of the project SKATeR (TIN2012-38584-C06-01). Tasks: To effectively reduce the ambiguities of the mention by expanding the query from its context. Enriching the background document integrating information retrieved from knowledge resources. 1-1) Query Classification 1-2) Background Document Enrichment. 1-3) Alternate Name Generation. TALP Research Center, UPC, Spain. A. Naderi, H. Rodríguez, and J. Turmo TALP EL System in TAC-KBP 2013 This poster presents our Entity Linking (EL) system that uses a topic modeling approach by taking advantage of a huge Wikipedia-based knowledge resource to enrich background documents with relevant information in order to increase the accuracy. ABSTRACT {anaderi, horacio, turmo}@lsi.upc.edu The VSM components for ranking candidates are extracted from the background document of each query. Thus, as most disambiguated entities as possible are required. For doing so, AIDA system (Hoffart et al., 2011) is applied. AIDA is a framework for entity detection and disambiguation. Given a natural- language text or a Web table, it maps mentions of ambiguous names onto canonical entities registered in YAGO2 (Hoffart et al., 2013). YAGO2 is a huge semantic KB derived from WP, WordNet (Fellbaum, 1998) and Geonames, containing more than 10 million entities and more than 120 million facts about these entities. Each entity in YAGO2 contains a sort of information, including weighted keyphrases. Keyphrase is contextual information extracted from link anchor, in-link, title and WP category sources of the corresponding entity page that can be used for entity disambiguation. We use AIDA to extract keyphrases from the entities in the background document. Figure 2 shows an example for producing related keyphrases of background document mentions “Man U”, “Liverpool”, and “Premier league” using AIDA for the query name Scholes. Detailed architecture of the system with a sample query Scholesis depicted in Figure 4. Task: This module sorts the retrieved candidates according to the likelihood of being the correct referent. Our ranking approach is a Vector Space Model (VSM) inspired by Cucerzan (2007). In our case the vector space domain consist of the whole set of word within the keyphrases found in the enriched background document and the rank consists of their Tf-Idf computed against the set of candidate documents. We use cosine similarity. In addition, in order to reduce dimensionality we apply LSI. A term clustering method is applied to cluster NIL queries. Universitat Politecnica de Catalunya (BarcelonaTech) Fig. 1: General architecture of EL systems Fig. 2: Enriching background document of the query Scholes” to generate keyphrases using AIDA system Fig. 3: Sample background document from the TAC-KBP data set Fig. 4: Detailed architecture of our EL system with a sample query “ScholesTALP Research Center Fig. 5: A KB candidate entity page for query “Scholescontaining a set of facts and its informative context In this step, a set of Alternate Names (ANs) of each query is generated from the content of its corresponding background document . In Figure 3, the system used Acronym expansion for extracting “Football Association” from “FA”. In addition, Several auxiliary gazetteers are applied such as: - The US states, (e.g., the pair <CA, California>). - Country abbreviations, (e.g., the pairs <UK, United Kingdom>). Thus, a set of potential candidates is generated from each AN of each query. Task: Given a particular query, q, a set of candidates, C, is found by retrieving those entries from the KB whose names are similar enough, using Dice coefficient, to one of the alternate names of q found with the query expansion. In general, KB entity pages contain facts and an informative context about the entity. We enrich the context information of each KB candidate entity by searching the corresponding facts as separate entities in the reference KB and then merging their related informative contexts with the current one. By applying this technique, the context of each candidate could be more discriminative and informative. Figure 5 shows a sample KB entity page corresponding to entity name “Paul Scholes. The system collects the <wiki_text> information of its related entities “Manchester United” and “England” to enrich the <wiki_text> of “Paul Scholes. All Docs All Entities PER ORG GPE Overall 0.435 0.535 0.538 0.248 In-KB 0.285 0.333 0.320 0.242 NIL 0.584 0.736 0.607 0.248 1. QUERY EXPANSION AND ENRICHMENT 3. CANDIDATE RANKING AND NIL CLUSTERING 2. CANDIDATE GENERATION Table. 1: The TALP official EL results (B-cubed+ F1) in TAC-KBP 2013 GENERAL ARCHITECTURE As shown in Figure 1, our EL approach follows the typical architecture in the state of the art including following steps: 1. Query Expansion and Enrichment 2. Candidate Generation 3. Candidate Ranking and NIL Clustering The system classified queries into 3 entity types: PER, ORG, GPE using Illinois NERC (Ratinov et al., 2009). It classifies all entity mentions in the background document. Considering all mentions with their type, those ones related to the query name are selected. The system chooses the longest mention (e.g., selecting full name of the Manchester United footballer “Paul Aaron Scholesrather than a part of its name “P . Scholesfor the query name Scholes”), and assign its type as query type.

Transcript of TALP EL System in TAC-KBP 2013anaderi/KBP2013-poster.pdf · TALP EL System in TAC-KBP 2013 This...

Page 1: TALP EL System in TAC-KBP 2013anaderi/KBP2013-poster.pdf · TALP EL System in TAC-KBP 2013 This poster presents our Entity Linking (EL) system that uses a topic modeling approach

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

RESULTS

REFERENCESS. Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In the Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural Language Learning, Prague.

C. Fellbaum. 1998. Wordnet: An electronic lexical database. In MIT Press.

J. Hoffart, , M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. 2011. Robust

disambiguation of named entities in text. In the Conference on Empirical Methods in Natural Language Processing (EMNLP),

Edinburgh, Scotland.

J. Hoffart, F. M. Suchanek, K. Berberich, and G.Weikum. 2013. Yago2: a spatially and temporally enhanced knowledge base from

Wikipedia. In Artificial Intelligence Journal.

L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.

ACKNOWLEDGMENTS

This work has been produced with the support of the project SKATeR (TIN2012-38584-C06-01).

Tasks:

To effectively reduce the ambiguities of the mention by expanding the query from its context.

Enriching the background document integrating information retrieved from knowledge resources.

1-1) Query Classification

1-2) Background Document Enrichment.

1-3) Alternate Name Generation.

TALP Research Center, UPC, Spain.

A. Naderi, H. Rodríguez, and J. Turmo

TALP EL System in TAC-KBP 2013

This poster presents our Entity Linking (EL) system that uses a topic modeling

approach by taking advantage of a huge Wikipedia-based knowledge resource to

enrich background documents with relevant information in order to increase the

accuracy.

ABSTRACT

{anaderi, horacio, turmo}@lsi.upc.edu

The VSM components for ranking candidates

are extracted from the background document

of each query. Thus, as most disambiguated

entities as possible are required. For doing so,

AIDA system (Hoffart et al., 2011) is applied.

AIDA is a framework for entity detection

and disambiguation. Given a natural-

language text or a Web table, it maps

mentions of ambiguous names onto

canonical entities registered in YAGO2

(Hoffart et al., 2013).

YAGO2 is a huge semantic KB derived from

WP, WordNet (Fellbaum, 1998) and

Geonames, containing more than 10 million

entities and more than 120 million facts

about these entities. Each entity in YAGO2

contains a sort of information, including

weighted keyphrases.

Keyphrase is contextual information

extracted from link anchor, in-link, title and

WP category sources of the corresponding

entity page that can be used for entity

disambiguation. We use AIDA to extract

keyphrases from the entities in the

background document.

Figure 2 shows an example for producing related

keyphrases of background document mentions

“Man U”, “Liverpool”, and “Premier league”

using AIDA for the query name “Scholes”.

Detailed architecture of the system with a

sample query “Scholes” is depicted in Figure 4.

Task:This module sorts the retrieved candidates according to the likelihood of being the correct referent.

Our ranking approach is a Vector Space Model (VSM) inspired by Cucerzan (2007).

In our case the vector space domain consist of the whole set of word within the keyphrases found in the enriched

background document and the rank consists of their Tf-Idf computed against the set of candidate documents. We

use cosine similarity. In addition, in order to reduce dimensionality we apply LSI.

A term clustering method is applied to cluster NIL queries.

Universitat Politecnica de Catalunya

(BarcelonaTech)

Fig. 1: General architecture of EL systems

Fig. 2: Enriching background document of the query

“Scholes” to generate keyphrases using AIDA system

Fig. 3: Sample background document from the TAC-KBP data set

Fig. 4: Detailed architecture of our EL system with a sample query “Scholes”

TALP Research Center

Fig. 5: A KB candidate entity page for query “Scholes”

containing a set of facts and its informative context

In this step, a set of Alternate Names (ANs) of each

query is generated from the content of its

corresponding background document . In Figure 3, the

system used Acronym expansion for extracting

“Football Association” from “FA”.

In addition, Several auxiliary gazetteers are applied

such as:

- The US states, (e.g., the pair <CA, California>).

- Country abbreviations, (e.g., the pairs <UK, United

Kingdom>).

Thus, a set of potential candidates is generated from

each AN of each query.

Task:Given a particular query, q, a set of candidates, C, is found by

retrieving those entries from the KB whose names are similar

enough, using Dice coefficient, to one of the alternate names of

q found with the query expansion.

In general, KB entity pages contain facts and an informative context

about the entity. We enrich the context information of each KB

candidate entity by searching the corresponding facts as separate

entities in the reference KB and then merging their related informative

contexts with the current one. By applying this technique, the context

of each candidate could be more discriminative and informative.

Figure 5 shows a sample KB entity page corresponding to entity name

“Paul Scholes”. The system collects the <wiki_text> information of its

related entities “Manchester United” and “England” to enrich the

<wiki_text> of “Paul Scholes”.

All Docs

All Entities PER ORG GPE

Overall 0.435 0.535 0.538 0.248

In-KB 0.285 0.333 0.320 0.242

NIL 0.584 0.736 0.607 0.248

1. QUERY EXPANSION AND ENRICHMENT

3. CANDIDATE RANKING AND NIL CLUSTERING

2. CANDIDATE GENERATION

Table. 1: The TALP official EL results (B-cubed+ F1) in TAC-KBP 2013

GENERAL ARCHITECTURE

As shown in Figure 1, our EL approach follows the typical architecture in the state of the art including following steps:

1. Query Expansion and Enrichment

2. Candidate Generation

3. Candidate Ranking and NIL Clustering

The system classified queries into 3 entity types: PER, ORG, GPE using Illinois NERC (Ratinov et al., 2009).

It classifies all entity mentions in the background document.

Considering all mentions with their type, those ones related to the query name are selected.

The system chooses the longest mention (e.g., selecting full name of the Manchester United footballer

“Paul Aaron Scholes” rather than a part of its name “P. Scholes” for the query name “Scholes”), and assign

its type as query type.