Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

148
Exploiting Wikipedia for Information Retrieval Tasks SIGIR Tutorial , August 2015 Department of Information Systems Engineering

Transcript of Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Page 1: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Exploiting Wikipedia for Information Retrieval Tasks

SIGIR Tutorial , August 2015

Department of Information Systems Engineering

Page 2: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Who We Are

Department of Information Systems EngineeringBen-Gurion University of the Negev

SIGIR 2015, Tutorial

Page 3: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Sessions Agenda

• Introduction to Wikipedia

• Session 1 – Search related tasks

• Session 2 - Sentiment Analysis

• Session 3 - Recommender Systems

• Summary

SIGIR 2015, Tutorial

Page 4: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Introduction

Getting Started!

SIGIR 2015, Tutorial

Page 5: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikimedia• “The Wikimedia Foundation, Inc. is a nonprofit

charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-based projects to the public free of charge”

• https://wikimediafoundation.org/wiki/Home

SIGIR 2015, Tutorial

Page 6: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

What is Wikipedia?

• Wikipedia is an encyclopedia

• no original research

• neutral point of view

• statements must be verifiable

• must reference reliable published sources

• Wikipedia relies on crowd-sourcing

• anyone can edit

• Wikipedia is big

printwikipedia

.com SIGIR 2015, Tutorial

Page 7: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Structured and Unstructured

Entity PagesCategoriesLinksDisambiguation pagesRedirecting pagesNavigation BoxesInfoboxesDiscussion pagesUser PagesPage Views

SIGIR 2015, Tutorial

Page 8: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Articles Number & Growth Rate• ~4.9 million articles on English Wikipedia

• Since 2006 around 30,000 new articles per month

• 287 languages

SIGIR 2015, Tutorial

Page 9: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Quality of Wikipedia Article

1. Is the length and structure of article an indication of the importance of this subject?

2. Click on edit history activity: when was the last edit?

3. Talk page checked for debates: what are the issues in this article?

4. Check the references [or lack of]– Are all claims referenced (especially if controversial )?

– What is the quality of the sources?

– How relevant are the sources?

https://upload.wikimedia.org/wikipedia/commons/9/96/Evaluating_Wikipedia_brochure_%28Wiki_Education_Foundation%29.pdfSIGIR 2015, Tutorial

Page 10: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Article Rating System

https://en.wikipedia.org/wiki/Wikipedia:Featured_articlesSIGIR 2015, Tutorial

Page 11: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Internet Encyclopaedias Go Head to Head. In Nature 1

38(15), [GILES, J. 2005]

SIGIR 2015, Tutorial

Page 12: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Links

CategoriesAmount of text

Images

Various languages

SIGIR 2015, Tutorial

Page ViewsWisdom of the Crowds: Decentralized Knowledge Construction in Wikipedia

]Arazy et al., 2007 ]

Page 13: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Superior Information Source

Size & Scope, Britannica has 40000

Timely & Updated

Wisdom of the crowd

SIGIR 2015, Tutorial

Page 14: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Lens for the Real World

Wikipedia: Representative of the real world and of people understanding

Ideas!

Thoughts

Perceptions

SIGIR 2015, Tutorial

Page 15: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Unique Visitors and Page Views

• http://reportcard.wmflabs.org/

430 Millions in May 2015Mobile users are not included!

SIGIR 2015, Tutorial

Page 16: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Editing Wikipedia

[Hill BM, Shaw A ,2013] The Wikipedia Gender Gap Revisited: Characterizing Survey Response

Bias with Propensity Score Estimation. PLoS ONE 8(6): e65782. SIGIR 2015, Tutorial

Page 17: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Literature Review of Scholarly Articles

• http://wikilit.referata.com/wiki/Main_Page

SIGIR 2015, Tutorial

Page 18: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Research Involving Wikipedia

Researching the WikipediaUsing the Wikipedia content as

knowledge resource

SIGIR 2015, Tutorial

Page 19: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Systems that Use Information from Wikipedia

The task – Goal of the system

•Query operations

•Recommendation

•Sentiment Analysis

•Ontology building

•…..

The challenge for which Wikipedia is a remedy

•Sparseness

•Ambiguity

•Cost of manual labor

•Lack of information

•Understanding /perception

•……….

Utilized Information

•Concepts/pages, links, categories, redirect pages, views, edits…….

Algorithms & techniques

•How items are matched with Wikipedia pages

•How data is extracted

•How Wikipedia data is utilized

•How the similarity between concepts is computed

•………..

SIGIR 2015, Tutorial

Page 20: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

IR & Wikipedia

• Wikipedia as a collection is:

– enormous

– Timely

– Reflects crowd wisdom

• connections between entities in Wikipedia represent the way a large number of people view them

• Computers cannot understand “concepts” cannot relate things like humans

– Accessible (free!)

– Accurate

– Coverage

• Weaknesses

– Incomplete

– Imbalanced

– No complete citations

SIGIR 2015, Tutorial

Page 21: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

IR & Wikipedia

• Wikipedia is used for

– enhancement the performance of IR systems (mainly relevance)

• Challenge –• Distillation of knowledge from such a large amount of un/semi

structured information is an extremely difficult task• The contents of today’s Web are perfectly suitable for human

consumption, but remain hardly accessible to machines.

SIGIR 2015, Tutorial

Page 22: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

How to Use Wikipedia for Your Own Research

SIGIR 2015, Tutorial

Page 23: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Structured Storage

Pages Categories

Links

Pages

Paragraphs

Redirection Pages

Queries

Documents Collection(TREC-X)

Schema 2 Schema 1

SIGIR 2015, Tutorial

Page 24: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Partial Diagram of Wikipedia’s Structured meta-Data

[Created by Gilad Katz]SIGIR 2015, Tutorial

Page 25: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia Download

Client Apps: XOWA, WikiTaxi, WikiReader,….

16 Offline tools for Downloading - 53GB of disk space

Page views download - size and structure 50GB per hour

EnwikiContentSource

Wikipedia Miner (Milne and Witten)

[An open source toolkit for mining Wikipedia,

2012]

XiaoxiaoLi/getWikipediaMetaData

SIGIR 2015, Tutorial

Page 26: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia Download

Google for : Wikipedia dump files downloadhttps://dumps.wikimedia.org/enwiki/

Torrent:https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki

Your and other’s code to get plain text!

SIGIR 2015, Tutorial

Page 27: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

DBPedia• ~4.9 billion articles, 4.2 billion of which are classified into consistent

ontology

• Persons, places, organizations, diseases

• An effort to transform the knowledge of Wikipedia into "tabular-like"format

• Sophisticated database-query language

- Open Data- Linked Open Data

UGCCGC

SIGIR 2015, Tutorial

Page 28: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Search Related Tasks

Session 1

SIGIR 2015, Tutorial

Page 29: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Crawl Control

Search Engine Basic Structure

Crawlers

Ranking

Indexer

Page Repository

Query Engine

Collection Analysis

QueriesResults

Indexes

SIGIR 2015, Tutorial

Page 30: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Operations Agenda

• Query Expansion

• Cross Language Information Retrieval

• Entity Search

• Query Performance Prediction

SIGIR 2015, Tutorial

Page 31: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Expansion

SIGIR 2015, Tutorial

Page 32: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

How to Describe the Unknown?

Meno: “And how will you enquire, Socrates, into

that which you do not know? What will you put

forth as the subject of enquiry?

And if you find what you want, how will you ever

know that this is the thing which you did not

know? ”

Written 380 B.C.E

By Plato

SIGIR 2015, Tutorial

Page 33: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Automatic QE - A process where the user’s original query is augmented by new features with similar meaning.

1.What you know and what you wish to know2. Initial vague query and concrete topics and

terminology

How to Describe the Unknown?

The average length of an initial query at prominent search engines is 2.4 in 2001, 2.08 in 2009, 3.1 nowdays (and growing…….)

[Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic (2001)]. "Searching the web: The public and their queries”.

[Taghavi, Wills, Patel (2011)] An analysis of web proxy logs with query distribution pattern approach for search engines

SIGIR 2015, Tutorial

Page 34: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia Based Query Expansion• Wikipedia is:

• Rich, Highly Interconnected, Domain independent

“The fourth generation iPad (originally marketed as iPad with Retina display,

retrospectively marketed as the iPad 4) is a tablet computer produced and marketed

by Apple Inc.” Wikipedia

SIGIR 2015, Tutorial

Page 35: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Knowledge based search engine powered by Wikipedia.[Milne, Witten and Nichols](2007).

Thesaurus Based QE

Initial query Wikipedia based

thesaurus

augmented bynew features

with similar meaning

SIGIR 2015, Tutorial

Page 36: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Semantic Relatednessis quantified.

Consider Jackson

Relatedness:Co-occurrence

statistics of termsand of links

(ESA as an alternative)

Synonyms:Redirect Pages

No NER process needed!

Relevant to particular document collection

Manual definition vs. automatic generation

KORU

[Milne, Witten and Nichols](2007).SIGIR 2015, Tutorial

Page 37: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Suggestion

query = obama white house

Task: Given a query, produce a ranked lists of concepts from Wikipediawhich are mentioned or meant in the query

Quiz Time!

SIGIR 2015, Tutorial

Page 38: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Learning Semantic Query Suggestions.

query = obama white house

Barack ObamaWhite House

Correct concepts

Use of some ranking (e.g. language modeling) approach to score the concepts (articles) in Wikipedia, where each n-gram is

considered as a query in its turn

1. CandidatesGeneration

[Meij, E., M. Bron, L. Hollink, B. Huurnink and M. De Rijke] (2009)

SIGIR 2015, Tutorial

Page 39: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• Supervised machine learning approach

• Input: a set of labeled examples (query2concept mappings)

• Types of features

• N-gram

• Concept

• N-gram – concept combination

• Current search history

2. CandidatesSelection

# of concepts linking to 2 c# of concepts linked from c

# of accosted categories# or redirect pages to c

Importance of c in query(TF-IDF Q in c)

SIGIR 2015, Tutorial

Page 40: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Cross Language Information Retrieval(CLIR)

SIGIR 2015, Tutorial

Page 41: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Machine Translation Firstأهلاً

SIGIR 2015, Tutorial

Page 42: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

CLIR Task

Query in a source language

Query in a target language

Collection translation is not scalable!

The solution:

SIGIR 2015, Tutorial

Page 43: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia[D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong] 2008

• Stage 1: Mapping the query to Wikipedia concepts

• Full query search (“obama white house”)

• Partial query search (N-Grams)

• Sole N-Grams in links of the top retrieved documents (“house”)

• Whole query with weighted n-grams (“obama white house”)

• Possible to combine search in different fields within a document:

• (title: obama) (content: white house) – to avoid missing terms

Generating Translation Candidates

SIGIR 2015, Tutorial

Page 44: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Stage 2: Creating the final expanded query terms weighting according to Stage 1 and Wikipedia pages analysis:

• Concept translation from cross lingual links• Try synonyms and spelling variants

• Translated term weighting : Concepts obtained from the whole query search are more important

Creating the Final Translated Query

Obama White House

Weiße Haus ^ 1.0

Barack Obama ^ 0.5

SIGIR 2015, Tutorial

Page 45: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Entity Search

SIGIR 2015, Tutorial

Page 46: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Entity SearchRetrieving a set of entities in response to user’s query

query = “United States presidents”

SIGIR 2015, Tutorial

Page 47: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Passages Entities

Entity Ranking:Graph centrality measures:

Degree.

Combined with inverse entity frequency

Ranking very many typed entities on Wikipedia[Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi]2007

SIGIR 2015, Tutorial

Page 48: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• Utilizing Wikipedia Category Structure

• Query-Categories (QC)

• Entity-Categories (EC)

• Distance(QC,EC)=

• if QC and EC have a common category then Distance = 0

• else: distance = minimal path length

Category BasedSemantic Distance Between Entities

A ranking framework for entity oriented search using Markov random fields.

[Raviv, H., D. Carmel and O. Kurland ](2012).

PUC

SIGIR 2015, Tutorial

Page 49: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Category Based Distance Example

Query categories:novels, books

If entity category is novels, then distance = zero

If entity category is books by Paul Auster, then distance = 2

SIGIR 2015, Tutorial

Page 50: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

The INNEX Ranking Task• The goal of the track is to evaluate how well systems can rank

entities in response to a query

• Entities are assumed to correspond to Wikipedia entries

Used category, association, link structureTook place in 2007, 2008, 2009

Overview of the INEX 2007 Entity Ranking Track

[Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, Mounia Lalmas.]

SIGIR 2015, Tutorial

Page 51: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Performance Prediction“The prophecy was given to the infants and the fools”

SIGIR 2015, Tutorial

Page 52: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Which Query is Better?

SIGIR 2015, Tutorial

Page 53: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Benefits of Estimating the Query Difficulty

Reproduced from a tutorial:David Carmel and Oren Kurland. Query performance prediction for IR. SIGIR 2012

Feedback to usersUser can rephrase a difficult query

Feedback to the search enginesAlternative retrieval strategy

Feedback to system administrator

Missing content

For IR applications

Federated search over different datasets

SIGIR 2015, Tutorial

Page 54: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Performance Prediction

Query1 = “obama white house” Query2 = “weather in Israel”

PredictionMechanism

Prediction value

Q1 > Q2?QueryLength(Q1)=3

Query.split().length()

SIGIR 2015, Tutorial

Page 55: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Retrieval Effectiveness

Predictors’ values

TREC Collections(topics,

relevance judgments)

RegressionFramework

MAP

SIGIR 2015, Tutorial

Page 56: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Performance Predictionin Ad-Hoc Retrieval

Estimating query effectiveness when relevance judgments are notavailable

SIGIR 2015, Tutorial

Page 57: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Performance Prediction

• Pre-Retrieval Prediction

• Operate prior to retrieval time

• Analyze the query and the corpus

• Computationally efficient

• Post-Retrieval Prediction

• Analyze the result list of the

• documents most highly ranked

• Superior performance

SIGIR 2015, Tutorial

Page 58: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Regardless of corpus, but with external knowledge

Absolute Query Difficulty

1. Corpus independent!2. Information induced from Wikipedia3. Advantage for non cooperative search where corpus-based

information is not available

Wikipedia-based query performance prediction.

[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014.

New Prediction Type?

SIGIR 2015, Tutorial

Page 59: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia Properties

Title

Content Links

Categories

Wikipedia-based query performance prediction.

[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014.SIGIR 2015, Tutorial

Page 60: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Notational Conventions

A page p is associated with a set of terms if its title contains at

least one of the terms in the set. (soft mapping)

q= “Barack Obama White House”

Associated page

Maximal exact match length (MEML) = 2Set of pages for which MEML holds MMEML

SIGIR 2015, Tutorial

Page 61: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Titles (measuring queries)

Size of subquery which has an exact match

1. Maximum2. Average

Quiz Time!q= “Barack Obama White House”

Maximum – 2Average – 2

SIGIR 2015, Tutorial

Page 62: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Titles & Content (measuring pages)

Number of pages associated with a subquery (fixed length = 1,2,3)

The average length (# terms) of the pages that are exact matchor maximal exact match

SumAverage Standard Deviation

“scope” of q in Wikipedia

SIGIR 2015, Tutorial

Page 63: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

ExamplesQuery

Max subquerymatch

#pages containing one query term in

the title

#pages containing two query terms

in the title

Horse Leg1 7885 3

Poliomyelitis and Post-Polio2 6605 8

Hubble Telescope Achievements2 460 4

Endangered Species (Mammals)1 3481 96

Most Dangerous Vehicles1 3978 6

African Civilian Deaths2 23381 14

New Hydroelectric Projects1 93858 33

Implant Dentistry1 268 0

Rap and Crime2 3826 1

Radio Waves and Brain Cancer2 15460 33

SIGIR 2015, Tutorial

Page 64: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Links and Categories

Overall # of links that contain at least one of query’s terms in theiranchor text

# of categories that appear in at least one of the pagesassociated with a subquery

# of in/outgoing links for Wikipedia page in MMEML

Average Standard Deviation

SIGIR 2015, Tutorial

Page 65: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Links for Query Coherency Prediction

# of links from pages associated with a single-termsubquery that point to pages associated with another subquery

q= “Barack Obama White House”

MaximumAverage Standard Deviation

SIGIR 2015, Tutorial

Page 66: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

ExamplesAverage # pages

Containing at least one query

term in link

Standard deviation of # of categories that contain one query term in the

title

Link based Coherency(average)

Horse Leg1821.5 34.7211 1731.5

Poliomyelitis and Post-Polio302 11.6409 1120.333

Hubble Telescope Achievements152.3333 11.19255 55.66667

Endangered Species (Mammals)5753.333 64.60477 686

Most Dangerous Vehicles376.6667 13.4316 396.3333

African Civilian Deaths1673.333 18.53009 4172.333

New Hydroelectric Projects1046 50.96217 121477.3

Implant Dentistry209.5 13.45904 10.5

Rap and Crime1946 35.30528 661.5

Radio Waves and Brain Cancer1395.5 24.81557 2202.5

SIGIR 2015, Tutorial

Page 67: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Query Performance Prediction - Summary

• Absolute query difficulty, corpus independent

• # of pages containing a subquery in the title – the most effective predictor

• Coherency is the 2nd

• Integration with state-of-the-art predictors leads to superior performance (Wikipedia-based-clarity-score)

SIGIR 2015, Tutorial

Page 68: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

“The prophecy was given to the infants and the fools”

Query-Performance Prediction: Setting the

Expectations Straight

[Fiana Raiber and Oren Kurland] (2014)

Quiz Time!

Storming of the Bastille!

SIGIR 2015, Tutorial

Page 69: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Exploiting Wikipedia for Sentiment Analysis

Session 2

SIGIR 2015, Tutorial

Page 70: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Introduction

• Sentiment analysis or opinion mining

The location is indeed perfect, across the

road from Hyde Park and Kensington

Palace. The building itself is charming,

…The room, which was an upgrade, was

ridiculously small. There was a persistent

unpleasant smell. The bathroom sink didn't

drain properly. The TV didn't work.

The location is indeed perfect

SIGIR 2015, Tutorial

Page 71: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Leveraging with Language Understanding

• Since they leave your door wide open when they come clean your room , the mosquitoes get in your room and they buzz around at night

• There were a few mosquitoes , but nothing that a little bug repellant could not solve

• It seems like people on the first floor had more issues with mosquitoes

• Also there was absolutely no issue with mosquitoes on the upper floors

SIGIR 2015, Tutorial

Page 72: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Challenge – Common Sense

SIGIR 2015, Tutorial

Page 73: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia as a Knowledgebase

SIGIR 2015, Tutorial

Page 74: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

?

SIGIR 2015, Tutorial

Page 75: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Concepts

– We will go shopping next week

• Relying on ontologies or semantic networks

• Using concepts steps away from blindly using keywords and words occurrences

[Cambria 2013: An introduction to concept level sentiment analysis]

[Cambria et al., 2014]

SIGIR 2015, Tutorial

Page 76: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Semantic Relatedness - ESA

• Concepts representation (training)

Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis

[Gabrilovich, & Markovitch, 2007]SIGIR 2015, Tutorial

Page 77: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

W1 W2 W3 W3

Weighted inverted index

W1

C1 C2 C3

C1 C2 C3 C4

Weight(Cj)

Text fragment

Word vector

Concept vector

Document Representation

SIGIR 2015, Tutorial

Page 78: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Similarity

SIGIR 2015, Tutorial

Page 79: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Word Sense Disambiguation

plant

SIGIR 2015, Tutorial

Page 80: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Wikipedia as a Sense Tagged Corpus

• Use hyperlinked concepts as sense annotations

• Extract senses and paragraphs of mentions of the ambiguous word in Wikipedia articles

• Learn a classification model

Using Wikipedia for Automatic Word Sense Disambiguation

[Mihalcea, 2007]SIGIR 2015, Tutorial

Page 81: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• In 1834, Sumner was admitted to the [[bar(law)|bar]] at the age of twenty-three, and entered private practice in Boston.

• It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar(music)|bar]].

• Vehicles of this type may contain expensive audio players, televisions, video players, and [[bar (counter)|bar]]s, often with refrigerators.

• Jenga is a popular beer in the [[bar(establishment)|bar]]s of Thailand.

• This is a disturbance on the water surface of a river

or estuary, often cause by the presence of a [[bar

(landform)|bar]] or dune on the riverbed.

Annotated Text

SIGIR 2015, Tutorial

Page 82: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• In 1834, Sumner was admitted to the [[bar(law)|bar]] at the age of twenty-three, and entered private practice in Boston.

• Features examples

– Current word and its POS

– Surrounding words and their POS

– The verb and noun in the vicinity

Example of Annotated Text

SIGIR 2015, Tutorial

Page 83: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Opinion Target Detection

SIGIR 2015, Tutorial

Page 84: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Challenge – Granularity Level

One can look at this review from– Document level, i.e., is this review + or -? – Sentence level, i.e., is each sentence + or -? – Entity and feature/aspect level

SIGIR 2015, Tutorial

Page 85: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Sentiment Lexicon

• Sentiment words are often the dominating factor for sentiment analysis (Liu, 2012)

– good, excellent, bad, terrible

• Sentiment lexicon holds a score for each word representing the degree of its sentiment

good

excellent

friendly

beautiful

+bad

terrible

ugly

difficult

-

SIGIR 2015, Tutorial

Page 86: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Sentiment Lexicon

• General lexicon

– There is no general sentiment lexicon that is optimal in any domain

– Sentiment of some terms is dependent on the context

– Poor coverage

“The device was small and handy."

"The waiter brought the food on time, but the portion was very small,"

SIGIR 2015, Tutorial

Page 87: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Yet Another Challenge: Polarity Ambiguity

SIGIR 2015, Tutorial

Page 88: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Remedy- Identify Opinion Targets

Old wine or warm beer: Target-specific sentiment analysis of adjectives

[Fahrni & Klenner, 2008]

SIGIR 2015, Tutorial

Page 89: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• If ’cold coca cola’ is positive then ’cold coca cola cherry’ is positive as well

cold

cool

warm

old

expensive

Coca Cola

Sushi

Pizza

cold

cool

warm

old

expensive

Coca Cola cherry

SIGIR 2015, Tutorial

Page 90: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Example (2): Product Attributes Detection

• Discover what are the attributes that people express opinions about

• Identifying all words that are included in Wikipedia titles

Domain Independent Model for Product Attribute Extraction from User Reviews using Wikipedia.

[Kovelamudi, Ramalingam, Sood & Varma, 2011]

Excellent picture quality.. videoz are in

HD.. no complaintz from me. Never had

any trouble with gamez.. Paid WAAAAY

to much for it at the time th0.. it sellz

SIGIR 2015, Tutorial

Page 91: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Excellent picture quality.. videoz are in

HD.. no complaintz from me. Never had

any trouble with gamez.. Paid WAAAAY

to much for it at the time th0.. it sellz

now fer like a third the price I paid..

heheh.. oh well....the fact that I didn’t

wait a year er so to buy a bigger model

for half the price.. most likely from a different

store.. ..not namin any namez th0..

WSD

SIGIR 2015, Tutorial

Page 92: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

modelSemantic Relatedness

price

𝑗=1

𝑗≠𝑖

𝑘

relatednessWi,Wj

k

count(Wi,P)

count(P)

featu

res

SIGIR 2015, Tutorial

Page 93: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Lexicon Construction (unsupervised)

• Label propagation

• Identify pairs of adjectives based on a set of constrains

• Infer from known adjectives

“the room was good and wide”

good

bad

wide29

Polarity=0.97

Unsupervised Common-Sense Knowledge Enrichment for Domain-Specific Sentiment Analysis

[Ofek, Rokach, Poria, Cambria, Hussein, Shabtai, 2015]SIGIR 2015, Tutorial

Page 94: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Computing Polarity

SIGIR 2015, Tutorial

Page 95: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

WikiSent: Sentiment Analysis of Movie Reviews

Wikisent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia

[Mukherjee & Bhattacharyya, 2012] SIGIR 2015, Tutorial

Page 96: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Feature Types

• Crew

• Cast

• Specific domain nouns from text content and Plot

– wizard, witch, magic, wand, death-eater, power

• Movie domain specific terms

– Movie, Staffing, Casting, Writing, Theory, Rewriting, Screenplay, Format, Treatments, Scriptments, Synopsis

SIGIR 2015, Tutorial

Page 97: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• Rank sentences according to participating entities

1. Title, crew2. Movie domain specific terms3. Plot

Sentiment classification– Weakly supervised – identify words’ polarity from

general lexicons

Retrieve Relevant Opinionated Text

subjectivity

+αXΣ+βXΣ-γXΣ

SIGIR 2015, Tutorial

Page 98: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Classify Blog Sentiment

• Use verb and adjectives categories

• Adjectives

– Positive

– Negative

• Verbs

– Positive verb classes, positive mental affecting, approving, praising

– Negative verb classes, abusing, accusing, arguing, doubting, negative mental affecting

Using verbs and adjectives to automatically classify blog sentiment

[Chesley, Vincent, Xu, & Srihari, 2006] SIGIR 2015, Tutorial

Page 99: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Expanding Adjectivescont.

• Query Wiktionary

• Assumption: an adjective’s polarity is reflected by the words that define it

• Maintain a set of adjectives with know polarity

• Count mentions of known adjectives to derive new polarity

SIGIR 2015, Tutorial

Page 100: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 101: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Recommender Systems

Session 3

SIGIR 2015, Tutorial

Page 102: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Recommender Systems

RS are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly.

(Xiao & Benbasat 2007)

• Different system designs and algorithms– Based on availability of data– Domain characteristics

SIGIR 2015, Tutorial

Page 103: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 104: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Content-based RecommendationA General Vector-Space Approach

Matching ThresholdingRelevant

content

Feedback

ProfileLearning

ThresholdLearning

threshold

SIGIR 2015, Tutorial

Page 105: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Challenge- Dull Content• Content does not contain enough information to

distinguish items the user likes from items the user does not like

– Result: Specifity – more of the same

– Vocabulary mismatch, limited aspects in content to distinguish relevant items, synonymy, polysemy

– Result: bad recommendations!

SIGIR 2015, Tutorial

Page 106: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Remedy: Enriching background knowledge Infusion of external sources

• Movie recommendations• Three external sources:

– Dictionary – (controlled) extracts of lemma descriptions - linguistic

– Wikipedia• pages related to movies are transformed into

semantic vectors using matrix factorization– User tags – on some sites people can add tags to

movies.

Knowledge Infusion into Movies Content Based Recommender Systems[Semeraro, Lops and Basile 2009]

SIGIR 2015, Tutorial

Page 107: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• The data combined from all sources is represented as a graph.

• The set of terms that describe an item is extended using a spreading activation model that connects terms based on semantic relations

Knowledge Infusion into Movies Content Based Recommender Systems[Semeraro, Lops and Basile 2009]

SIGIR 2015, Tutorial

Page 108: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

axe, murder, paranormal, hotel,

winter

Process- ExampleThe Shining

Keyword search

perceptions, killer, psychiatry.

spreading activation of external knowledge

search

“Carrie”“The silence of the

lambs”

SIGIR 2015, Tutorial

Page 109: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Tweets Content-Based Recommendations

• GOAL : Re-rank tweet messages based on similarity to user content profile

• Method: The user interest profile is represented as

two vectors : concepts from Wikipedia, and affinity with other users

user’s profile is expended by random walk on Wikipedia concepts graph, utilizing the inter-links between Wikipedia articles.

SIGIR 2015, Tutorial

[Lu, Lam & Xang, 2012] Twitter User Modeling and Tweets Recommendation

based onWikipedia Concept Graph

Page 110: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Algorithm steps

Map a Twitter message to a set

of concepts employing

Explicit Semantic Analysis (ESA)

Generate a user profile from two vectors,

a concept vector representing user interested topics, a vector representing the affinity with other users.

In order to get related concepts random walk on the Wikipedia graph is applied

User profile and the tweet are represented as a set of weighted Wikipedia concepts.

Cosine similarity is applied to compute thescore btw. Profile and tweet messages.

SIGIR 2015, Tutorial

Page 111: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Results

SIGIR 2015, Tutorial

Page 112: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

kNN - Nearest Neighbor

SVD – Matrix Factorization

The method of making automatic predictions

(filtering) about the interests of a user by

collecting taste information from many users

(collaborating). The underlying assumption of

CF approach is that those who agreed in the

past tend to agree again in the future.

Collaborative Filtering1

Descri

ption

Main

App

roach

es

Collaborative Filtering

SIGIR 2015, Tutorial

Page 113: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Collaborative Filtering

Trying to predict the opinion the user will have on

the different items and be able to recommend the

“best” items to each user based on: the user’s

previous likings and the opinions of other like

minded users

abcdThe Idea

?Positive

Rating

Negative

Rating

SIGIR 2015, Tutorial

Page 114: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

The ratings (or events) of users and items are represented in a matrixAll CF methods are based on such rating matrix

abcdSample of a matrix

The Users in the system

abcdUsers The Items in the system

abcdItems

Collaborative Filtering Rating Matrix

I1 I2 I3 I4 I5 I6 …. In

U1 r

U2 r

U3

U4 r

U5

…. r

Um r

Each item may have a rating

abcdRatings

SIGIR 2015, Tutorial

Page 115: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

abcd“People who liked this also liked…”

Collaborative Filtering

Approach 1: Nearest Neighbors

Item to

Item

User to User

abcdUser-to-User

Recommendations are made by finding users with similar tastes. Jane and Tim both liked Item 2 and disliked Item 3; it seems they might have similar taste, which suggests that in general Jane agrees with Tim. This makes Item 1 a good recommendation for Tim.

Item-to-Item Recommendations are made by finding items that have

similar appeal to many users. Tom and Sandra liked both Item 1 and Item 4. That suggests that, in general, people who liked Item 4 will also like item 1, Item 1 will be recommended to Tim. This approach is scalable to millions of users and millions of items.

SIGIR 2015, Tutorial

Page 116: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Some Math…..

m

i uiu

m

i aia

m

i uiuaia

ua

rrrr

rrrrw

1

2

,1

2

,

1 ,,

,

n

u ua

n

u uauiu

aia

w

wrrrp

1 ,

1 ,,

,

*

Similarity

Prediction

SIGIR 2015, Tutorial

Page 117: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Computing the Item-to-Item Similarity

We must have:

For each customer ui the list of products bought by ui

For each product pj the list of users that bought that

Amazon.com Recommendations Item-to-Item Collaborative Filtering [Greg Linden, Brent Smith, and Jeremy York], 2003, IEEE Internet Computing

SIGIR 2015, Tutorial

Page 118: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

The Main Challenge – Lack of Data

• Sparseness

• Long Tail

– many items in the Long Tail have only few ratings

• Cold Start

– System cannot draw any inferences for users or items about which it has not yet gathered sufficient data

SIGIR 2015, Tutorial

Page 119: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

External Sources as a Remedy

Recommendation Engine

User data

recommendations

Items data

External Sources

External data : about users, and items From :WWW, social networks, other systems, other user’s devices, and Wikipedia

SIGIR 2015, Tutorial

Page 120: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Computing Similarities Using Wikipedia for Sparse Matrices

• Utilize knowledge from Wikipedia to infer the similarities between items/users

• Systems differ in what Wikipedia data is used and how it is used

Similar????

SIGIR 2015, Tutorial

Page 121: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Example

Step 1: Map item to Wikipedia pages

– Generate several variations of each item’s (movie) name– Compare the generated names with corresponding page

titles in Wikipedia. – Choose the page with the largest number of categories that

contain the word "film" (e.g., "Films shot in Vancouver").• Using this technique, 1512 items out of the 1682 (89.8%)

contained in the MovieLens database were matched

Using Wikipedia to Boost Collaborative Filtering Techniques , Recsys 2011 [Katz, Ofek, Shapira, Rokach, Shani, 2011],

SIGIR 2015, Tutorial

Page 122: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 123: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Step 2: Use Wikipedia information to compute similarity between items

– Text similarity – calculate the Cosine-Similarity between item pages in Wikipedia

– Category Similarity – count mutual categories

– Link Similarity – count mutual outgoing links

SIGIR 2015, Tutorial

Page 124: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Use Wikipedia to Generate “Artificial Ratings”

Step 3: use the item similarity matrix and each user’s actual ratings to generate additional, “artificial” ratings.

• i – the item for which we wish to generate a rating

• K – the set of items with actual ratings

SIGIR 2015, Tutorial

Page 125: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Use Wikipedia to Generate “Artificial Ratings”

Step 4: add the artificial ratings to the user-item matrix

• Use the artificial ratings only when there’s no real value

• This matrix will be used for the collaborative filtering

SIGIR 2015, Tutorial

Page 126: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Results

• The sparser the initial matrix, the greater the improvement

SIGIR 2015, Tutorial

Page 127: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Results – Collaborative Filtering

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

0.94 0.95 0.96 0.97 0.98 0.99 1

RMSE

% OF DATA SPARSITY

PERFORMANCE OF THE VARIOUS METHODS

basic item-item

categories

links

text

IMDB text

SIGIR 2015, Tutorial

Page 128: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Comparison – Wikipedia and IMDB

0

100

200

300

400

500

600

700

800

900

1000

Number of movie descriptions

Number of words in text

Comparison of words per movie

Wikipedia

IMDB

SIGIR 2015, Tutorial

Page 129: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Yet another CF algorithm - SVD

SIGIR 2015, Tutorial

Page 130: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

abcdFactorization

MF is a Decomposition of a matrix into

several component matrices, exposing

many of the useful and interesting

properties of the original matrix.

MF models users and items as vectors

of latent features which produce the

rating for the user of the item

With MF a matrix is factored into a

series of linear approximations that

expose the underlying structure of the

matrix.

The goal is to uncover latent features

that explain observed ratings

abcd

09.08.2015

Collaborative Filtering

Approach 2: Matrix factorization

SIGIR 2015, Tutorial

Page 131: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Users & Ratings Latent Concepts or Factors

MF Process

abcdSVD

MF reveals

hidden

connections

and its

strength

abcdHidden Concept

Latent Factor Models - Example

User

Rating

abcdSVD

SIGIR 2015, Tutorial

Page 132: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Users & Ratings Latent Concepts or Factors

SVD

revealed

a movie

this user

might like!

abcdRecommendation

Latent Factor Models - Example

SIGIR 2015, Tutorial

Page 133: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

09.08.2015

Latent Factor Models - Concept space

SIGIR 2015, Tutorial

Page 134: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial
Page 135: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Integrating Wikipedia to the SVD Process

• Learn parameters for the actual and “artificial ratings”

SIGIR 2015, Tutorial

Page 136: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Using Wikipedia for generating ontology for recommender systems

SIGIR 2015, Tutorial

Page 137: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Example (1)-

• Content recommendations for scholars using ontology to describe scholar’s profile

• Challenge: singular reference ontologies lack sufficient ontological concepts and are unable to represent the scholars’ knowledge.

A reference ontology for profiling scholar’s background knowledge in recommender systems, 2014 [Bahram, Roliana, Moohd,Nematbakhsh]

SIGIR 2015, Tutorial

Page 138: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

• Building ontology based profiles for modeling background knowledge of scholars

• Build ontology by merging a few sources

• Wikipedia is used both as a knowledge source and for merging ontologies (verifying the semantic similarity between two candidate terms)

SIGIR 2015, Tutorial

Page 139: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 140: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 141: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Example (2) Tag Recommendation• Challenge- “bad” tags (e.g., misspelled,

irrelevant) lead to improper relationship among items and ineffective searches for topic information

SIGIR 2015, Tutorial

Page 142: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Ontology Based Tag Recommendation

• Integrate Wikipedia categories with WordNetSynsets to create Ontology for tag recommendation

Effective Tag Recommendation System Based on Topic Ontology using Wikipedia and WordNet [Subramaniyaswamy and Chentur 2012]

SIGIR 2015, Tutorial

Page 143: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

SIGIR 2015, Tutorial

Page 144: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Results

SIGIR 2015, Tutorial

Page 145: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

We demonstrated how Wikipedia was used for :

• Recsys, SA

Ontology creation

• Recsys, QE, QPP

Semantic relatedness

• CLIR, QE

Synonym detection

• Entity search, QPP

Relevance assessment

• Sa, QE

Disambiguation

• Recsys, SA

Domain specific knowledge acquisition

SIGIR 2015, Tutorial

Page 146: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

What was not Covered

• Using behaviors (views, edit)

• Examining time effect

• ……

• More tasks, more methods (QA, advertising…)

• Wikipedia weaknesses

• We only showed a few examples of the Wikipedia power, and its potential

SIGIR 2015, Tutorial

Page 147: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Use Wikipedia, it is a Treasure…..

SIGIR 2015, Tutorial

Page 148: Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial

Thank You

Complete references list:

http://vitiokm.wix.com/wikitutorial

SIGIR 2015, Tutorial