Concept-Based Information Retrieval using Explicit Semantic Analysis
-
Upload
ofer-egozi -
Category
Technology
-
view
3.879 -
download
1
Transcript of Concept-Based Information Retrieval using Explicit Semantic Analysis
![Page 1: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/1.jpg)
Concept-Based Information Retrieval using Explicit Semantic Analysis
M.Sc. Seminar talk
Ofer Egozi, CS Department, TechnionSupervisor: Prof. Shaul Markovitch 24/6/09
![Page 2: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/2.jpg)
Information Retrieval
QueryIR
Recall
Precision
![Page 3: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/3.jpg)
Ranked retrieval
QueryIR
![Page 4: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/4.jpg)
Keyword-based retrieval
QueryIR
Bag Of Words (BOW)
![Page 5: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/5.jpg)
Problem: retrieval misses
QueryIR
salvaging shipwreck treasureI I
“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
TREC document LA071689-0089
TREC topic #411
![Page 6: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/6.jpg)
The vocabulary problem
salvaging shipwreck treasure
“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
?Identity: Syntax (tokenization, stemming…)
Similarity: Synonyms (Wordnet etc.)
Relatedness: Semantics / world knowledge (???)
?
[but also deliver/scavenge/relieve]
[but also shipping/treasurer]
Synonymy / Polysemy
![Page 7: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/7.jpg)
Concept-based retrieval
salvaging shipwreck treasure
“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
IR
![Page 8: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/8.jpg)
Concept-based representations
• Human-edited Thesauri (e.g. WordNet)– Source: editors , concepts: words , mapping: manual
• Corpus-based Thesauri (e.g. co-occurrence)– Source: corpus , concepts: words , mapping: automatic
• Ontology mapping (e.g. KeyConcept)– Source: ontology , concepts: ontology node(s) , mapping: automatic
• Latent analysis (e.g. LSA, pLSA, LDA)– Source: corpus , concepts: word distributions , mapping: automatic
Insufficient granularity
Non-intuitive Concepts
Expensive repetitive
computations
Non-scalable solution
![Page 9: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/9.jpg)
Concept-based representations
• Human-edited Thesauri (e.g. WordNet)– Source: editors , concepts: words , mapping: manual
• Corpus-based Thesauri (e.g. co-occurrence)– Source: corpus , concepts: words , mapping: automatic
• Ontology mapping (e.g. KeyConcept)– Source: ontology , concepts: ontology node(s) , mapping: automatic
• Latent analysis (e.g. LSA, pLSA, LDA)– Source: corpus , concepts: word distributions , mapping: automatic
Insufficient granularity
Non-intuitive Concepts
Expensive repetitive
computations
Non-scalable solution
Is it possible to devise a concept-based representation, that is scalable, computationally feasible, and uses intuitive and granular concepts?
![Page 10: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/10.jpg)
Explicit Semantic Analysis
Gabrilovich and Markovitch (2005,2006,2007)
![Page 11: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/11.jpg)
PantheraWorld War II
Jane Fonda Island
Wikipedia is viewed as an ontology - a collection of ~1M concepts
concept
Explicit Semantic Analysis (ESA)
![Page 12: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/12.jpg)
Wikipedia is viewed as an ontology - a collection of ~1M concepts
Every Wikipedia article represents a concept
concept
Panthera
Explicit Semantic Analysis (ESA)
Article words are associated with the concept (TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
![Page 13: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/13.jpg)
Wikipedia is viewed as an ontology - a collection of ~1M concepts
Every Wikipedia article represents a concept
Panthera
Explicit Semantic Analysis (ESA)
Article words are associated with the concept (TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
![Page 14: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/14.jpg)
Wikipedia is viewed as an ontology - a collection of ~1M concepts
Every Wikipedia article represents a concept
Panthera
Explicit Semantic Analysis (ESA)
Article words are associated with the concept (TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
The semantics of a word is the vector of its associations with Wikipedia concepts
Cat Panthera[0.92]
Cat[0.95]
Jane Fonda[0.07]
![Page 15: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/15.jpg)
Explicit Semantic Analysis (ESA)The semantics of a text fragment is the average vector (centroid) of the semantics of its words
buttonDick
Button[0.84]
Button[0.93]
Game Controller
[0.32]
Mouse (computing)
[0.81]
mouseMouse
(computing)[0.84]
Mouse (rodent)[0.91]
John Steinbeck
[0.17]
MickeyMouse [0.81]
mouse buttonDrag-
and-drop[0.91]
Mouse (computing)
[0.95]
Mouse (rodent)[0.56]
Game Controller
[0.64]
In practice – disambiguation…
mouse button
![Page 16: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/16.jpg)
MORAG*: An ESA-based information retrieval algorithm
*MORAG: Flail in Hebrew
“Concept-based feature generation and selection for information retrieval”, AAAI-2008
![Page 17: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/17.jpg)
Enrich documents/queries
ESA
QueryIR
Constraint: use only the strongest concepts
![Page 18: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/18.jpg)
Problem: documents (in)coherence
REFERENCE BOOKS SPEAK VOLUMES TO KIDS;
With the school year in high gear, it's a good time to consider new additions to children's home reference libraries…
…Also new from Pharos-World Almanac: "The World Almanac InfoPedia," a single-volume visual encyclopedia designed for ages 8 to 16…
…"The Doubleday Children's Encyclopedia," designed for youngsters 7 to 11, bridges the gap between single-subject picture books and formal reference books…
…"The Lost Wreck of the Isis" by Robert Ballard is the latest adventure in the Time Quest Series from Scholastic-Madison Press ($15.95 hardcover). Designed for children 8 to 12, it tells the story of Ballard's 1988 discovery of an ancient Roman shipwreck deep in the Mediterranean Sea…
TREC document LA120790-0036
Document is judged relevant for topic 411 due to one relevant passage in it
Concepts generated for this document will average to the books / children concepts, and lose the shipwreck mentions…
Not an issue in BOW retrieval where words are indexed independently. How to deal with in concept-based?
![Page 19: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/19.jpg)
Solution: split to passages
ESA
QueryIR
Index both full document and passages.Best performance achieved by fixed-length
overlapping sliding windows.
ConceptScore(d) = ConceptScore(full-doc) + max ConceptScore(passage) passaged
![Page 20: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/20.jpg)
MORAG ranking
QueryIR
Score(q,d) = ConceptScore(q,d) + (1-)KeywordScore(q,d)
![Page 21: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/21.jpg)
ESA-based retrieval example
salvaging shipwreck treasure
“ANCIENT ARTIFACTS FOUND. Divers have recovered artifacts lying underwater for more than 2,000 years in the wreck of a Roman ship that sank in the Gulf of Baratti, 12 miles off the island of Elba, newspapers reported Saturday."
•SHIPWRECK•TREASURE•MARITIME ARCHAEOLOGY•MARINE SALVAGE•HISTORY OF THE BRITISH VIRGIN ISLANDS•WRECKING (SHIPWRECK)•KEY WEST, FLORIDA•FLOTSAM AND JETSAM•WRECK DIVING•SPANISH TREASURE FLEET•SCUBA DIVING•WRECK DIVING•RMS TITANIC•USS HOEL (DD-533)•SHIPWRECK•UNDERWATER ARCHAEOLOGY•USS MAINE (ACR-1)•MARITIME ARCHAEOLOGY•TOMB RAIDER II•USS MEADE (DD-602)
![Page 22: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/22.jpg)
• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• 2006 COMMONWEALTH GAMES• ESTONIA AT THE 2006 WINTER OLYMPICS• 1992 SUMMER OLYMPICS• ATHLETICS AT THE 2004 SUMMER OLYMPICS• 2000 SUMMER OLYMPICS• 2006 WINTER OLYMPICS• CROSS-COUNTRY SKIING 2006 WINTER OLYMPICS• NEW ZEALAND AT THE 2006 WINTER OLYMPICS
Problem: irrelevant docs retrieved
“Olympic News In Brief: Cycling win for Estonia. Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "
I IEstonia economy !
• ESTONIA• ECONOMY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• ESTONIA NATIONAL FOOTBALL TEAM• ESTONIA AT THE 2006 WINTER OLYMPICS• BALTIC SEA• EUROZONE• TIIT VÄHI• MILITARY OF ESTONIA
??
??
TREC topic #434
![Page 23: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/23.jpg)
![Page 24: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/24.jpg)
“Economy” is not mentioned, but TF·IDF of “Estonia” is
strong enough to trigger this concept on its own…
![Page 25: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/25.jpg)
• Selection could remove noisy ESA concepts
• However, IR task provides no training data…
Problem: selecting query features
Utility function U(+|-) requires target measure
>> training set
f=ESA(q) Filter
U
f’
Focus on query concepts - Query is short and noisy, while
FS at indexing lacks context
![Page 26: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/26.jpg)
Solution: Pseudo Relevance FeedbackUse BOW results as positive / negative examples
![Page 27: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/27.jpg)
ESA feature selection methods
1. IG (filter) – calculate each feature’s Information Gain in separating positive and negative examples, take best performing features
2. RV (filter) – add concepts in the positive examples to candidate features, and re-weight all features based on their weights in examples
3. IIG (wrapper) – find subset of features that best separates positive and negative examples, employing heuristic search
![Page 28: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/28.jpg)
• ESTONIA• ECONOMY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• ESTONIA NATIONAL FOOTBALL TEAM• ESTONIA AT THE 2006 WINTER OLYMPICS• BALTIC SEA• EUROZONE• TIIT VÄHI• MILITARY OF ESTONIA• ESTONIA AT THE 2000 SUMMER OLYMPICS• ESTONIA AT THE 2004 SUMMER OLYMPICS• 2006 COMMONWEALTH GAMES• ESTONIA AT THE 2006 WINTER OLYMPICS• 1992 SUMMER OLYMPICS• ATHLETICS AT THE 2004 SUMMER OLYMPICS• 2000 SUMMER OLYMPICS• 2006 WINTER OLYMPICS• CROSS-COUNTRY SKIING 2006 WINTER OLYMPICS• NEW ZEALAND AT THE 2006 WINTER OLYMPICS
ESA-based retrieval – FS example
“Olympic News In Brief: Cycling win for Estonia. Erika Salumae won Estonia's first Olympic gold when retaining the women's cycling individual sprint title she won four years ago in Seoul as a Soviet athlete. "
Estonia economy
Broadfeatures
• MONETARY POLICY• EURO• ECONOMY OF EUROPE• NORDIC COUNTRIES• PRIME MINISTER OF ESTONIA
• NEOLIBERALISMNoise features
RV addsfeatures
Useful ones “bubble up”
![Page 29: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/29.jpg)
MORAG evaluation
• Testing over TREC-8 and Robust-04 datasets (528K documents, 50 web-like queries)
• Feature selection is highly effective
![Page 30: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/30.jpg)
MORAG evaluation
• Significant performance improvement, over our own baseline and also over top performing TREC-8 BOW baselines
Concept-based performance by itself is quite low, a major reason is the TREC ‘pooling’ method, which implies that relevant documents found only by Morag will not be judged as such…
![Page 31: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/31.jpg)
MORAG evaluation
• Optimal (“Oracle”) selection analysis shows much more potential for MORAG
![Page 32: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/32.jpg)
MORAG evaluation
• Pseudo-relevance proves to be a good approximation of actual relevance
![Page 33: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/33.jpg)
Conclusion
• MORAG: a new methodology for concept-based information retrieval
• Documents and query are enhanced by Wikipedia concepts
• Informative features are selected using pseudo-relevance feedback
• The generated features improve the performance of BOW-based systems
![Page 34: Concept-Based Information Retrieval using Explicit Semantic Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022205/58cfdd891a28ab13238b5dd3/html5/thumbnails/34.jpg)
Thank you!