Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science,...

16
Learning to Link with Learning to Link with Wikipedia Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented by Dongjoo Lee, IDS Lab., CSE, SNU

Transcript of Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science,...

Page 1: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Learning to Link with Learning to Link with WikipediaWikipedia

David Milne and Ian H. WittenDepartment of Computer Science, University of Waikato

CIKM 2008 (Best Paper Award)

Presented by Dongjoo Lee, IDS Lab., CSE, SNU

Page 2: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

IntroductionIntroduction

Wikification

Find significant topics and links them to Wiki documents.

2IDS Lab. 2009 Spring Seminar

Page 3: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Related WorkRelated Work

Not restricting documents for the destination of automatically identified links Smart-Tag Service (Microsoft), AutoLink (Google)

Many concerned that pages were being “surreptitiously” modified for commercial purposes

Automatic linking is most successful when restricted to safe domains such as cinema (Drenner et al. 2006)

Using Wikipedia as a destination for links Wikify (Mihalcea and Csomai, 2007)

– Detection involves identifying the terms and phrases from which links should be made.

– Disambiguation ensures that the detected phrases link to the appropriate article.

Topic indexing Identifying the most significant topics; those which the document was

written about

Maron, 1977, Medelyan et al., 2008

3IDS Lab. 2009 Spring Seminar

Page 4: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to Link with WekipediaLearning to Link with Wekipedia

Learning to disambiguate links

Learning to detect links

Wikification in the wild

Examples and implications

Conclusions

4IDS Lab. 2009 Spring Seminar

Page 5: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to disambiguate links - Learning to disambiguate links - commonnesscommonness

balancing the commonness of a sense with its relatedness to the surrounding context

commonness (prior probability): the number of times a wiki document is used as a destination in Wikipedia

5IDS Lab. 2009 Spring Seminar

Page 6: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to disambiguate links - relatednessLearning to disambiguate links - relatedness

6IDS Lab. 2009 Spring Seminar

Comparing each possible sense with its surrounding context

Words consisting context also may be ambiguous

Use un ambiguous words that has only one sense

– ex) algorithm, uniformed search, LIFO stack

Reduced to selecting the sense article that has most in common with all of the context articles

a,b: articles of interest

A, B: sets of all articles that link to a and b

W: a set containing all articles in Wikipedia

some context terms are better than others

|)||,log(min(||)log(|

|)log(||))||,log(max(|),(

BAW

BABAbasrelatednes

Page 7: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Training – Configuration – TestTraining – Configuration – Test

7IDS Lab. 2009 Spring Seminar

Training Set(500)

Training Set(500)

ConfigurationSet

(500)

ConfigurationSet

(500)

Test Set(100)

Test Set(100)

TrainingTraining ConfigurationConfiguration TestTest

find an optimal classifier and variables

Training Evaluation

precision recall f-measure

Page 8: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to disambiguate links Learning to disambiguate links – configuration and attribute selection– configuration and attribute selection

identifying the most suitable classification algorithm

setting minimum probability of senses that are considered by the algorithm

reduce the required time to compare relatedness between context and candidate senses

8IDS Lab. 2009 Spring Seminar

Page 9: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to disambiguate links - evaluationLearning to disambiguate links - evaluation

9IDS Lab. 2009 Spring Seminar

Page 10: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to detection linksLearning to detection links

Naïve approach (Mihalcea and Csomai 2008)

If probability that a word or phrase had been linked to an article exceeds a certain threshold, a link is attached to it

Presented approach

Machine learning link detector that uses various features

– Link probability

– Relatedness

– Disambiguation confidence

– Generality: the minimum depth at which it is located in Wikipedia’s category tree

– Location and Spread

first occurrence, last occurrence, spread (distance between them)

10IDS Lab. 2009 Spring Seminar

Page 11: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to detection links (cont’d)Learning to detection links (cont’d)

11IDS Lab. 2009 Spring Seminar

Page 12: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Learning to detection links Learning to detection links - - training and configuration, and evaluationtraining and configuration, and evaluation

12IDS Lab. 2009 Spring Seminar

Page 13: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Wikification in the wildWikification in the wild

Experimental data

subset of 50 documents from the AQUAINT

Participants and tasks

Mechanical Turk (Barr and Cabrera, 2006)

– a crowd sourcing service hosted by Amazon provides a way for human judgment to be easily incorporated into software applications

Results

13IDS Lab. 2009 Spring Seminar

Page 14: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

Examples and implicationsExamples and implications

14IDS Lab. 2009 Spring Seminar

Page 15: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

ConclusionConclusion

The present paper’s contribution is a proven method of extracting key concepts from plain text that has been evaluated against an extensive body of human performance

15IDS Lab. 2009 Spring Seminar

Page 16: Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Copyright 2009 by CEBT

DiscussionDiscussion

well written

clear motivation and contribution

clear presentation about the method they have done in order to accomplish their goal

but not much new idea

combination of existing features that are frequently used for text classification and so on

16IDS Lab. 2009 Spring Seminar