Concept-Based Information Retrieval using Explicit Semantic Analysis

34
Concept-Based Information Retrieval using Explicit Semantic Analysis M.Sc. Seminar talk Ofer Egozi, CS Department, Technion Supervisor: Prof. Shaul Markovitch 24/6/09

Transcript of Concept-Based Information Retrieval using Explicit Semantic Analysis

Page 1: Concept-Based Information Retrieval using Explicit Semantic Analysis

Concept-Based Information Retrieval using Explicit Semantic Analysis

M.Sc. Seminar talk

Ofer Egozi, CS Department, TechnionSupervisor: Prof. Shaul Markovitch 24/6/09

Page 2: Concept-Based Information Retrieval using Explicit Semantic Analysis

Information Retrieval

QueryIR

Recall

Precision

Page 3: Concept-Based Information Retrieval using Explicit Semantic Analysis

Ranked retrieval

QueryIR

Page 4: Concept-Based Information Retrieval using Explicit Semantic Analysis

Keyword-based retrieval

QueryIR

Bag Of Words (BOW)

Page 5: Concept-Based Information Retrieval using Explicit Semantic Analysis

Problem: retrieval misses

QueryIR

salvaging shipwreck treasureI I

“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

TREC document LA071689-0089

TREC topic #411

Page 6: Concept-Based Information Retrieval using Explicit Semantic Analysis

The vocabulary problem

salvaging shipwreck treasure

“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

?Identity: Syntax (tokenization, stemming…)

Similarity: Synonyms (Wordnet etc.)

Relatedness: Semantics / world knowledge (???)

?

[but also deliver/scavenge/relieve]

[but also shipping/treasurer]

Synonymy / Polysemy

Page 7: Concept-Based Information Retrieval using Explicit Semantic Analysis

Concept-based retrieval

salvaging shipwreck treasure

“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

IR

Page 8: Concept-Based Information Retrieval using Explicit Semantic Analysis

Concept-based representations

• Human-edited Thesauri (e.g. WordNet)– Source: editors , concepts: words , mapping: manual

• Corpus-based Thesauri (e.g. co-occurrence)– Source: corpus , concepts: words , mapping: automatic

• Ontology mapping (e.g. KeyConcept)– Source: ontology , concepts: ontology node(s) , mapping: automatic

• Latent analysis (e.g. LSA, pLSA, LDA)– Source: corpus , concepts: word distributions , mapping: automatic

Insufficient granularity

Non-intuitive Concepts

Expensive repetitive

computations

Non-scalable solution

Page 9: Concept-Based Information Retrieval using Explicit Semantic Analysis

Concept-based representations

• Human-edited Thesauri (e.g. WordNet)– Source: editors , concepts: words , mapping: manual

• Corpus-based Thesauri (e.g. co-occurrence)– Source: corpus , concepts: words , mapping: automatic

• Ontology mapping (e.g. KeyConcept)– Source: ontology , concepts: ontology node(s) , mapping: automatic

• Latent analysis (e.g. LSA, pLSA, LDA)– Source: corpus , concepts: word distributions , mapping: automatic

Insufficient granularity

Non-intuitive Concepts

Expensive repetitive

computations

Non-scalable solution

Is it possible to devise a concept-based representation, that is scalable, computationally feasible, and uses intuitive and granular concepts?

Page 10: Concept-Based Information Retrieval using Explicit Semantic Analysis

Explicit Semantic Analysis

Gabrilovich and Markovitch (2005,2006,2007)

Page 11: Concept-Based Information Retrieval using Explicit Semantic Analysis

PantheraWorld War II

Jane Fonda Island

Wikipedia is viewed as an ontology - a collection of ~1M concepts

concept

Explicit Semantic Analysis (ESA)

Page 12: Concept-Based Information Retrieval using Explicit Semantic Analysis

Wikipedia is viewed as an ontology - a collection of ~1M concepts

Every Wikipedia article represents a concept

concept

Panthera

Explicit Semantic Analysis (ESA)

Article words are associated with the concept (TF.IDF)

Cat [0.92]

Leopard [0.84]

Roar [0.77]

Page 13: Concept-Based Information Retrieval using Explicit Semantic Analysis

Wikipedia is viewed as an ontology - a collection of ~1M concepts

Every Wikipedia article represents a concept

Panthera

Explicit Semantic Analysis (ESA)

Article words are associated with the concept (TF.IDF)

Cat [0.92]

Leopard [0.84]

Roar [0.77]

Page 14: Concept-Based Information Retrieval using Explicit Semantic Analysis

Wikipedia is viewed as an ontology - a collection of ~1M concepts

Every Wikipedia article represents a concept

Panthera

Explicit Semantic Analysis (ESA)

Article words are associated with the concept (TF.IDF)

Cat [0.92]

Leopard [0.84]

Roar [0.77]

The semantics of a word is the vector of its associations with Wikipedia concepts

Cat Panthera[0.92]

Cat[0.95]

Jane Fonda[0.07]

Page 15: Concept-Based Information Retrieval using Explicit Semantic Analysis

Explicit Semantic Analysis (ESA)The semantics of a text fragment is the average vector (centroid) of the semantics of its words

buttonDick

Button[0.84]

Button[0.93]

Game Controller

[0.32]

Mouse (computing)

[0.81]

mouseMouse

(computing)[0.84]

Mouse (rodent)[0.91]

John Steinbeck

[0.17]

MickeyMouse [0.81]

mouse buttonDrag-

and-drop[0.91]

Mouse (computing)

[0.95]

Mouse (rodent)[0.56]

Game Controller

[0.64]

In practice – disambiguation…

mouse button

Page 16: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG*: An ESA-based information retrieval algorithm

*MORAG: Flail in Hebrew

“Concept-based feature generation and selection for information retrieval”, AAAI-2008

Page 17: Concept-Based Information Retrieval using Explicit Semantic Analysis

Enrich documents/queries

ESA

QueryIR

Constraint: use only the strongest concepts

Page 18: Concept-Based Information Retrieval using Explicit Semantic Analysis

Problem: documents (in)coherence

REFERENCE BOOKS SPEAK VOLUMES TO KIDS;

With the school year in high gear, it's a good time to consider new additions to children's home reference libraries…

…Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16…

…"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…

…"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…

TREC document LA120790-0036

Document is judged relevant for topic 411 due to one relevant passage in it

Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…

Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?

Page 19: Concept-Based Information Retrieval using Explicit Semantic Analysis

Solution: split to passages

ESA

QueryIR

Index both full document and passages.Best performance achieved by fixed-length

overlapping sliding windows.

ConceptScore(d) = ConceptScore(full-doc) + max ConceptScore(passage) passaged

Page 20: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG ranking

QueryIR

Score(q,d) = ConceptScore(q,d) + (1-)KeywordScore(q,d)

Page 21: Concept-Based Information Retrieval using Explicit Semantic Analysis

ESA-based retrieval example

salvaging shipwreck treasure

“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."

•SHIPWRECK•TREASURE•MARITIME ARCHAEOLOGY•MARINE SALVAGE•HISTORY OF THE BRITISH VIRGIN ISLANDS•WRECKING (SHIPWRECK)•KEY WEST, FLORIDA•FLOTSAM AND JETSAM•WRECK DIVING•SPANISH TREASURE FLEET•SCUBA DIVING•WRECK DIVING•RMS TITANIC•USS HOEL (DD-533)•SHIPWRECK•UNDERWATER ARCHAEOLOGY•USS MAINE (ACR-1)•MARITIME ARCHAEOLOGY•TOMB RAIDER II•USS MEADE (DD-602)

Page 22: Concept-Based Information Retrieval using Explicit Semantic Analysis

• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• 2006 COMMONWEALTH GAMES• ESTONIA AT THE 2006 WINTER OLYMPICS• 1992 SUMMER OLYMPICS• ATHLETICS AT THE 2004 SUMMER OLYMPICS• 2000 SUMMER OLYMPICS• 2006 WINTER OLYMPICS• CROSS-COUNTRY SKIING 2006 WINTER OLYMPICS• NEW ZEALAND AT THE 2006 WINTER OLYMPICS

Problem: irrelevant docs retrieved

“Olympic News In Brief: Cycling win for Estonia. Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "

I IEstonia economy !

• ESTONIA• ECONOMY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• ESTONIA NATIONAL FOOTBALL TEAM• ESTONIA AT THE 2006 WINTER OLYMPICS• BALTIC SEA• EUROZONE• TIIT VÄHI• MILITARY OF ESTONIA

??

??

TREC topic #434

Page 23: Concept-Based Information Retrieval using Explicit Semantic Analysis
Page 24: Concept-Based Information Retrieval using Explicit Semantic Analysis

“Economy” is not mentioned, but TF·IDF of “Estonia” is

strong enough to trigger this concept on its own…

Page 25: Concept-Based Information Retrieval using Explicit Semantic Analysis

• Selection could remove noisy ESA concepts

• However, IR task provides no training data…

Problem: selecting query features

Utility function U(+|-) requires target measure

>> training set

f=ESA(q) Filter

U

f’

Focus on query concepts - Query is short and noisy, while

FS at indexing lacks context

Page 26: Concept-Based Information Retrieval using Explicit Semantic Analysis

Solution: Pseudo Relevance FeedbackUse BOW results as positive / negative examples

Page 27: Concept-Based Information Retrieval using Explicit Semantic Analysis

ESA feature selection methods

1. IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features

2. RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples

3. IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search

Page 28: Concept-Based Information Retrieval using Explicit Semantic Analysis

• ESTONIA• ECONOMY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• ESTONIA NATIONAL FOOTBALL TEAM• ESTONIA AT THE 2006 WINTER OLYMPICS• BALTIC SEA• EUROZONE• TIIT VÄHI• MILITARY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• 2006 COMMONWEALTH GAMES• ESTONIA AT THE 2006 WINTER OLYMPICS• 1992 SUMMER OLYMPICS• ATHLETICS AT THE 2004 SUMMER OLYMPICS• 2000 SUMMER OLYMPICS• 2006 WINTER OLYMPICS• CROSS-COUNTRY SKIING 2006 WINTER OLYMPICS• NEW ZEALAND AT THE 2006 WINTER OLYMPICS

ESA-based retrieval – FS example

“Olympic News In Brief: Cycling win for Estonia. Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "

Estonia economy

Broadfeatures

• MONETARY POLICY• EURO• ECONOMY OF EUROPE• NORDIC COUNTRIES• PRIME MINISTER OF ESTONIA

• NEOLIBERALISMNoise features

RV addsfeatures

Useful ones “bubble up”

Page 29: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG evaluation

• Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)

• Feature selection is highly effective

Page 30: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG evaluation

• Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines

Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…

Page 31: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG evaluation

• Optimal (“Oracle”) selection analysis shows much more potential for MORAG

Page 32: Concept-Based Information Retrieval using Explicit Semantic Analysis

MORAG evaluation

• Pseudo-relevance proves to be a good approximation of actual relevance

Page 33: Concept-Based Information Retrieval using Explicit Semantic Analysis

Conclusion

• MORAG: a new methodology for concept-based information retrieval

• Documents and query are enhanced by Wikipedia concepts

• Informative features are selected using pseudo-relevance feedback

• The generated features improve the performance of BOW-based systems

Page 34: Concept-Based Information Retrieval using Explicit Semantic Analysis

Thank you!