Entity Linking - Indian Institute of Technology...

Post on 02-Oct-2020

7 views 0 download

Transcript of Entity Linking - Indian Institute of Technology...

Entity Linking

Pawan Goyal

CSE, IIT Kharagpur

August 17th, 2016

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 1 / 23

What is Entity Linking?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 2 / 23

What is Entity Linking?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 3 / 23

What is Entity Linking?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 4 / 23

Entity Linking: Common Steps

Determine “linkable” phrases

mention detection - MD

Rank/Select candidate entity links

link generation - LG

Use “context” to disambiguate/filter/improve

disambiguation - DA

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23

Entity Linking: Common Steps

Determine “linkable” phrases

mention detection - MD

Rank/Select candidate entity links

link generation - LG

Use “context” to disambiguate/filter/improve

disambiguation - DA

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23

Entity Linking: Common Steps

Determine “linkable” phrases

mention detection - MD

Rank/Select candidate entity links

link generation - LG

Use “context” to disambiguate/filter/improve

disambiguation - DA

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23

Mention Detection (MD)

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 6 / 23

Link Generation (LG)

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 7 / 23

Disambiguation (DA)

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 8 / 23

Preliminaries: Wikipedia

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 9 / 23

Preliminaries: Disambiguation Pages

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 10 / 23

Some Statistics

WordNet80k entity definitions

142k senses (entity - surface forms)

Wikipedia4M entity definitions

24M senses

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 11 / 23

Some Statistics

WordNet80k entity definitions

142k senses (entity - surface forms)

Wikipedia4M entity definitions

24M senses

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 11 / 23

Wikipedia based methods

What can be a good measure for MD?

keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23

Wikipedia based methods

What can be a good measure for MD?

keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23

Wikipedia based methods

What can be a good measure for MD?

keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23

Wikipedia based methods

What can be a good measure for DA?

commonness(w,c)The fraction of times, a particular sense is used as a destination in Wikipedia.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 13 / 23

Wikipedia based methods

What can be a good measure for DA?

commonness(w,c)The fraction of times, a particular sense is used as a destination in Wikipedia.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 13 / 23

Commonness and keyphraseness

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 14 / 23

Always the best decision?

Using Relatedness: Basic IdeaIn a sufficiently long text, one finds terms that do not requiredisambiguation at all.

Use every unambiguous link in the document as context to disambiguateambiguous ones.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 15 / 23

Always the best decision?

Using Relatedness: Basic IdeaIn a sufficiently long text, one finds terms that do not requiredisambiguation at all.

Use every unambiguous link in the document as context to disambiguateambiguous ones.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 15 / 23

Finding Relatedness: A link-based measure

relatedness(c,c′)Using the intersection among incoming as well as outgoing links of twoWikipedia pages

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 16 / 23

Finding Relatedness: A link-based measure

relatedness(c,c′)Using the intersection among incoming as well as outgoing links of twoWikipedia pages

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 16 / 23

Computing Relatedness

Each candidate sense and context term is represented by a singleWikipedia article.

Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.

Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.

The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.

How to give different weights to the context terms?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23

Computing Relatedness

Each candidate sense and context term is represented by a singleWikipedia article.

Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.

Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.

The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.

How to give different weights to the context terms?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23

Computing Relatedness

Each candidate sense and context term is represented by a singleWikipedia article.

Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.

Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.

The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.

How to give different weights to the context terms?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23

Computing Relatedness

Each candidate sense and context term is represented by a singleWikipedia article.

Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.

Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.

The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.

How to give different weights to the context terms?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23

Computing Relatedness

Each candidate sense and context term is represented by a singleWikipedia article.

Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.

Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.

The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.

How to give different weights to the context terms?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23

Weighting the Context Terms

link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination

relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms

These two variables - link probability and relatedness - are averaged toprovide a weight for each context.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23

Weighting the Context Terms

link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination

relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms

These two variables - link probability and relatedness - are averaged toprovide a weight for each context.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23

Weighting the Context Terms

link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination

relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms

These two variables - link probability and relatedness - are averaged toprovide a weight for each context.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23

Weighting the Context Terms

link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination

relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms

These two variables - link probability and relatedness - are averaged toprovide a weight for each context.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23

Can we improve mention detection with this approach?

The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold.

Is it the best method?

All the remaining phrases are disambiguated using the approachmentioned earlier.

This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.

Can you use this to learn – which concepts should be linked?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23

Can we improve mention detection with this approach?

The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?

All the remaining phrases are disambiguated using the approachmentioned earlier.

This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.

Can you use this to learn – which concepts should be linked?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23

Can we improve mention detection with this approach?

The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?

All the remaining phrases are disambiguated using the approachmentioned earlier.

This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.

Can you use this to learn – which concepts should be linked?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23

Can we improve mention detection with this approach?

The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?

All the remaining phrases are disambiguated using the approachmentioned earlier.

This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.

Can you use this to learn – which concepts should be linked?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23

Can we improve mention detection with this approach?

The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?

All the remaining phrases are disambiguated using the approachmentioned earlier.

This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.

Can you use this to learn – which concepts should be linked?

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23

Example

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 20 / 23

The Learning Problem: Which topics should be linked?

The automatically identified Wikipedia articles provide training instancesfor a classifier.

Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.

Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23

The Learning Problem: Which topics should be linked?

The automatically identified Wikipedia articles provide training instancesfor a classifier.

Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.

Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23

The Learning Problem: Which topics should be linked?

The automatically identified Wikipedia articles provide training instancesfor a classifier.

Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.

Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23

What are the features?

Link Probability: Average as well as maximum of link probability of thelink locations – (e.g. Hillary Clinton and Clinton)

Relatedness: Topics which relate to the central thread of the documentare more likely to be linked

Disambiguation Confidence: The confidence score of the classifier fordisambiguation

Generality: Defined as the minimum depth at which it is located inWikipedia’s category tree. More useful for the readers to provide links forspecific topics.

Location and Spread: Where are these mentioned? First occurrence,last occurrence and the spread.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 22 / 23

References

Mihalcea, Rada, and Andras Csomai. “Wikify!: linking documents toencyclopedic knowledge.” Proceedings of the sixteenth ACM conferenceon information and knowledge management. ACM, 2007.

Milne, David, and Ian H. Witten. “Learning to link with wikipedia.”Proceedings of the 17th ACM conference on Information and knowledgemanagement. ACM, 2008.

Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 23 / 23