Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

32
Probabilistic Models of Probabilistic Models of Novel Document Rankings Novel Document Rankings for Faceted Topic for Faceted Topic Retrieval Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ) Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling

description

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval. Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ). Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling. Agenda. - PowerPoint PPT Presentation

Transcript of Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Page 1: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Probabilistic Models of Novel Probabilistic Models of Novel Document Rankings for Document Rankings for Faceted Topic RetrievalFaceted Topic RetrievalBen Cartrette and Praveen Chandar

Dept. of Computer and Information ScienceUniversity of Delaware

Newark, DE ( CIKM ’09 )

Date: 2010/05/03Speaker: Lin, Yi-JhenAdvisor: Dr. Koh, Jia-Ling

Page 2: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

AgendaAgendaIntroduction

- Motivation, GoalFaceted Topic Retrieval

- Task, EvaluationFaceted Topic Retrieval Models

- 4 kinds of modelsExperiment & ResultsConclusion

Page 3: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Introduction - Motivation Introduction - Motivation Modeling documents as independently

relevant does not necessarily provide the optimal user experience.

Page 4: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Traditional evaluation measure

would reward System1 since it has higher

recall

Introduction - MotivationIntroduction - Motivation

Actually, we prefer System2 (since it has more

information)

System2 is better !

Page 5: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Introduction Introduction Novelty and diversity become the

new definition of relevance and evaluation measures .

They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.

we call faceted topic retrieval !

Page 6: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Introduction - Goal Introduction - Goal The faceted topic retrieval

system must be able to find a small set of documents that covers all of the facets

3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

Page 7: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTaskDefine the task in terms ofInformation need :

A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated

How that need is best satisfied :Each answer is fully contained within at least one document

Page 8: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTask

Information need

invest in next generation technologies

increase use of renewable energy sourcesInvest in renewable energy sources

double ethanol in gas supply

shift to biodiesel

shift to coal

Facets (a set of answers)

Page 9: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Faceted Topic Retrieval Faceted Topic Retrieval A Query :A sort list of keywords

A ranked list of documents that contain as many

unique facets as possible.

D1

Dn

D2

Page 10: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Faceted Topic Retrieval -Faceted Topic Retrieval -EvaluationEvaluationS-recallS-precisionRedundancy

Page 11: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Evaluation – Evaluation – an example for S-recall and S-precisionan example for S-recall and S-precisionTotal : 10 facets (assume all facets

in documents are non-overlapped)

Page 12: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Evaluation – Evaluation – an example for Redundancyan example for Redundancy

Page 13: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Faceted topic retrieval Faceted topic retrieval modelsmodels4 kinds of models

- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach

Page 14: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

1. MMR1. MMR

2. Probabilistic 2. Probabilistic Interpretation of MMRInterpretation of MMR

Let c1=0, c3=c4

Page 15: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

3. Greedy Result Set 3. Greedy Result Set PruningPruningFirst, rank without considering

novelty (in order of relevance)Second, step down the list of

documents, prune documents with similarity greater than some threshold ϴ

I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ

Page 16: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach P(F ϵ D) :Probability of D contains Fthe probability that a facet Fj occurs

in at least one document in a set D is

the probability that all of the facets in a set F are captured by the documents D is

Page 17: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach4.1 Hypothesizing Facets4.2 Estimating Document-Facet

Probabilities4.3 Maximizing Likelihood

Page 18: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4.1 Hypothesizing Facets4.1 Hypothesizing FacetsTwo unsupervised probabilistic methods

:Relevance modelingTopic modeling with LDA

Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)

Page 19: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4.1 Hypothesizing Facets4.1 Hypothesizing FacetsSince we do not know the facet

terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents

Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

Page 20: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Relevance modelingRelevance modelingEstimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach:

DFj : the set of documents relevant to facet Fj fk : facet terms

Page 21: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Topic modeling with LDATopic modeling with LDAProbabilistic P(w|Fj) and P(Fj)

can found through expectation maximization

Page 22: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4.2 Estimating Document-4.2 Estimating Document-Facet ProbabilitiesFacet ProbabilitiesBoth the facet relevance model and

LDA model produce generation probabilistic P(Di|Fj)

P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di

Page 23: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

4.3 Maximizing Likelihood4.3 Maximizing LikelihoodDefine the likelihood function

Constrain : K : hypothesized minimum number

required to cover the facetsMaximizing L(y) is a NP-Hard problemApproximate solution :For each facet Fj, take the document

Di with maximum

Page 24: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiment - DataExperiment - DataA Query :A sort list of keywords

Top 130 retrieved documents

D1

D130

D2Query Likelihood L.M.

Page 25: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiment - DataExperiment - Data

Top 130 retrieved

documents

D1

D130

D2

2 assessors to judge

44.7 relevant documents per query

Each document contains 4.3 facets

39.2 unique facets on average( for average one unique facet per relevant document )

Agreement :72% of all relevant documents were judged relevant by both assessors

For 60 queries :

Page 26: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiment - DataExperiment - DataTDT5 sample topic definition

Judgments

Query

Page 27: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiment – Retrieval Experiment – Retrieval EnginesEnginesUsing Lemur toolkitLM baseline: a query-likelihood language

modelRM baseline: a pseudo-feedback with

relevance modelMMR: query similarity scores from LM

baseline and cosine similarity for noveltyAvgMix (Prob MMR) : the probabilistic MMR

model using query-likelihood scores from LM baseline and the AvgMix novelty score.

Pruning: removing documents from the LM baseline on cosine similarity

FM: the set-based facet model

Page 28: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiment – Retrieval Experiment – Retrieval EnginesEnginesFM: the set-based facet model FM-RM:

each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj)

FM-LDA: use LDA to discover subtopics zj, and get P(zj|D) , we extract 50 subtopics

Page 29: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Experiments - EvaluationExperiments - EvaluationUse five-fold cross-validation to

train and test systems48 queries in four folds to train

model parameters Parameters are used to obtain

ranked results on the remaining 12 queries

At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

Page 30: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

ResultsResults

Page 31: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

ResultsResults

Page 32: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

ConclusionConclusionWe defined a type of novelty retrieval

task called faceted topic retrieval retrieve the facets of information need in a small set of documents.

We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models.

Both models are competitive with MMR, and outperform another probabilistic model.