Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based...

32

Transcript of Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based...

Page 1: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 2: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Knowledge-based Information Retrieval

with

Wikipedia

David Milne | Ian H. Witten

The University of Waikato | New Zealand

Koru Wikipedia Link-based Measure Wikification

Page 3: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Limitations of search engines

“Search is not solved” Current search engines

don’t understand documents don’t understand queries

Page 4: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Knowledge-based information retrieval

Consult an external knowledge base to find out what these characters mean and proactively do stuff with them

A fairly obvious, compelling idea But one that hasn’t worked out

We haven’t had the right knowledge base Computers aren’t accurate enough Humans aren’t quick enough

Page 5: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikipedia | as a knowledge base

What topics/concepts are there? ~2 million articles and categories

How are topics referred to? ~5 million titles, redirects and anchors

How do topics relate to each other? ~60 million article and category links

football team sports ball sports

rugby league touch rugby

rugbyrugby world

cup

rugby union

all blacks

australia national rugby team

RWC

New Zealand national rugby team

Wallabies

Page 6: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikipedia | as a knowledge base

WordNet118,000 synsets

ResearchCyc300,000 concepts

Wikipedia2,000,000 articles

20 years

7 years

$$$$$$$

Almost free

1 language

250 languages

Page 7: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

2004 U.S. presidential

election controversy and

irregularities

George W. Bush

Dubya

Shrubya

Thief in chief

Al-Qaeda September 11

Iraq War

George Walker Bush George W. Bush

Presidents of the united states

Current national leaders

heads of state

◄ Formal structure

Wikipedia ►

Wikipedia | as a knowledge base

Page 8: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

My research goals

“Wikipedia will provide significantly improved retrieval, as it is” We don’t need to make it “tidy” It’s not a question of sophisticated NLP or AI It’s more about HCI

So lets make a search engine that consults Wikipedia, and find out!

Page 9: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Koru

Wikipedia

Documents

Queries

Document Topics

Query Topics

RelatedTopics

WikiSaurus

Page 10: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Koru | interface

Page 11: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 12: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 13: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 14: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 15: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

securityAND

(air carrier OR airline company OR airline industry OR flight company OR modern aviation OR passenger

aircraft….)AND

(America OR American OR American continent…)

Page 16: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 17: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 18: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 19: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these
Page 20: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Koru | Evaluation

Wikipedia matches query terminology extremely well

Recognition and expansion of topics improves retrieval

Recognition of topics modifies query behavior Related topics need further investigation Extraction of thesaurus terms is inaccurate

“rugby world cup” vs. (“rugby world cup” OR rwc OR “web ellis cup”)

rugby world cup vs. “rugby world cup”

Page 21: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

What now

We need to improve how topics and the relations between them are extracted

Semantic Relatedness Wikification

Page 22: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Semantic Relatedness

Given any two terms, what is the strength of the semantic relation between them?

Highly useful AI, data mining, IR, NLP

But subjective

RadioTelevision……

LifeStockCarPlaneInternetComputerKeyboardComputerPaperBookTigerTigerCatTigerSexLove

6.77…

0.925.777.587.627.46

10.007.356.77

Page 23: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Semantic Relatedness | with Wikipedia

Two techniques have been developed already

19% - 48%WikiRelate!75%Explicit Semantic Analysis

Scale and structure GBs of text millions of articles hundreds of thousands of categories

Page 24: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Semantic Relatedness | Wikipedia links

Wikipedia has an extremely rich hyperlink structure that has been ignored so far.

Global WarmingAutomobile

Petrol Engine

Fossil Fuel

20th Century Emission

Standard

Bicycle

Diesel Engine

Carbon Dioxide Air

PollutionGreenhouse

Gas

Alternative Fuel

Transport

Vehicle

Henry Ford

Combustion Engine

Kyoto Protocol

Ozone

Greenhouse Effect

Planet

Audi

Battery(electricity)

Arctic Circle Environmental

Skepticism

GreenpeaceEcology

Page 25: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Semantic Relatedness | evaluation

WLMESAWikiRelateDataset

49%52%45%

75%82%73%

64%Rubenstein & Goodenough

70%Miller & Charles

69%WordSimilarity 353

WikiRelate < WLM < ESA

Page 26: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikification

How do we accurately cross-reference documents with Wikipedia?

Wikipedia contains millions of examples of how to do this. Which terms relate to concepts? How do we resolve ambiguous terms? How do we select the concepts that are relevant?

Page 27: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikification | identifying concept terms

Wikipedia’s links provide a huge vocabulary of which terms can resolve to which concepts

“Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.”

Six (number) Article (grammar)

One halfProperty

0.002%

15%

Page 28: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikification | resolving ambiguity

For every link in Wikipedia, a human author has manually chosen the correct destination.

“Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.”

A movement in flightAn underwater hillEdge of river or streamFinancial institution

0.3%0.3%1.8%

97.0%“The story begins on the banks of the Rio Negro in the Central Amazon. A party of scientists is embarking on a voyage which they hope will provide answers to a five hundred year old mystery.”

recall 96% | precision 98%

Page 29: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

Wikification | selecting relevant concepts

Wikipedians do not link to every single article only ones that readers would want to investigate

“Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.”

“The story begins on the banks of the Rio Negro in the Central Amazon. A party of scientists is embarking on a voyage which they hope will provide answers to a five hundred year old mystery.”

recall 74% | precision 74%

Page 30: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

What next?

Explore applications for Wikification Topic Indexing Document Clustering Document Summarization

Revisit Koru Apply semantic relatedness and wikification to

knowledge base generation, query expansion, and exploratory search

Write up!

Page 31: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these

ReferencesMilne, D., Medelyan, O. and Witten, I. H. Mining Domain-Specific

Thesauri from Wikipedia: A case study. In Proceedings of WI 2006, Hong Kong.

Milne, D., Witten, I.H. and Nichols, D.M. A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of CIKM 2007, Lisbon, Portugal.

Milne, D. and Witten, I.H. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of WIKIAI 2008, Chicago, I.L.

Milne, D. and Witten, I.H. Learning to link with Wikipedia. To appear in Proceedings of CIKM 2008, Napa Valley, California.

Websites and Demoswww.cs.waikato.ac.nz/~dnk2www.nzdl.org/koruwikipedia-miner.sourceforge.netwww.nzdl.org/wikification

Page 32: Knowledge-based - Heidelberg University › colloquium › docs › milne_slides.pdfKnowledge-based information retrieval Consult an external knowledge base to find out what these