Post on 04-Jan-2016
Less is MoreProbabilistic Models for Retrieving Fewer Relevant Documents
Harr Chen, David R. KargerMIT CSAIL
ACM SIGIR 2006August 9, 2006
August 9, 2006 ACM SIGIR 2006 Slide 2
Outline
• Motivations
• Expected Metric Principle
• Metrics
• Bayesian Retrieval
• Objectives
• Heuristics
• Experimental Results
• Related Work
• Future Work and Conclusions
August 9, 2006 ACM SIGIR 2006 Slide 3
Motivation
• In IR, we have formal models, and formal metrics
• Models provide framework for retrieval– E.g.: Probabilistic
• Metrics provide rigorous evaluation mechanism– E.g.: Precision and recall
• Probability ranking principle (PRP) provably optimal for precision/recall– Ranking by probability of relevance
• But other metrics capture other notions of result set quality and PRP isn’t necessarily optimal
August 9, 2006 ACM SIGIR 2006 Slide 4
Example: Diversity
• User may be satisfied with one relevant result– Navigational queries, question/answering
• In this case, we want to “hedge our bets” by retrieving for diversity in result set– Better to satisfy different users with different
interpretations, than one user many times over
• Reciprocal rank/search length metrics capture this notion
• PRP is suboptimal
August 9, 2006 ACM SIGIR 2006 Slide 5
IR System Design
• Metrics define preference ordering on result sets– Metric[Result set 1] > Metric[Result set 2]
Result set 1 preferred to Result set 2
• Traditional approach: Try out heuristics that we believe will improve relevance performance– Heuristics not directly motivated by metric
– E.g. synonym expansion, psuedorelevance feedback
• Observation: Given a model, we can try to directly optimize for some metric
August 9, 2006 ACM SIGIR 2006 Slide 6
Expected Metric Principle (EMP)
• Knowing which metric to use tells us what to maximize for – the expected value of the metric for each result set, given a model
Corpus
Document 1
Document 2
Document 3
1, 2
1, 3
2, 1
2, 3
3, 1
3, 2
Result Sets
CalculateE[Metric]
usingmodel
Returnsetwithmax
score
August 9, 2006 ACM SIGIR 2006 Slide 7
Our Contributions
• Primary: EMP – metric as retrieval goal– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length, reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization problem tractable
• Secondary: retrieving for diversity (special case)– A natural side effect of optimizing for certain metrics
August 9, 2006 ACM SIGIR 2006 Slide 8
Detour: What is a Heuristic?
Ad hoc approach
• Use heuristics that are believed to be correlated with good performance
• Heuristics used to improve relevance
• Heuristics (probably) make system slower
• Infinite number of possibilities, no formalism
• Model, heuristics intertwined
Our approach
• Build model that directly optimizes for good performance
• Heuristics used to improve efficiency
• Heuristics (probably) make optimization worse
• Well-known space of optimization techniques
• Clean separation between model and heuristics
August 9, 2006 ACM SIGIR 2006 Slide 9
Our Contributions
• Primary: EMP – metric as retrieval goal– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length, reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization problem tractable
• Secondary: retrieving for diversity (special case)– A natural side effect of optimizing for certain metrics
August 9, 2006 ACM SIGIR 2006 Slide 10
Search Length/Reciprocal Rank
• (Mean) search length (MSL): number of irrelevant results until first relevant
• (Mean) reciprocal rank (MRR): one over rank of first relevant
}Search length = 2
Reciprocal rank = 1/3
August 9, 2006 ACM SIGIR 2006 Slide 11
Instance Recall
• Each topic has multiple instances (subtopics, aspects)
• Instance recall is how many instances covered (in union) over first n results
} Instance recall @ 5 = 0.75
August 9, 2006 ACM SIGIR 2006 Slide 12
k-call @ n
• Binary metric: 1 if top n results has k relevant, 0 otherwise
• 1-call is (1 – %no)– See TREC robust track
} 1-call @ 5 = 1
2-call @ 5 = 1
3-call @ 5 = 0
August 9, 2006 ACM SIGIR 2006 Slide 13
Motivation for k-call
• 1-call: Want one relevant document– Many queries satisfied with one relevant result
– Only need one relevant document, more room to explore promotes result set diversity
• n-call: Want all relevant documents– “Perfect precision”
– Hone in on one interpretation and stick to it!
• Intermediate k– Risk/reward tradeoff
• Plus, easily modeled in our framework– Binary variable
August 9, 2006 ACM SIGIR 2006 Slide 14
Our Contributions
• Primary: EMP – metric as retrieval goal– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length, reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization problem tractable
• Secondary: retrieving for diversity (special case)– A natural side effect of optimizing for certain metrics
August 9, 2006 ACM SIGIR 2006 Slide 15
Bayesian Retrieval Model
• There exists distributions that generate relevant documents, irrelevant documents
• PRP: rank by
• Remaining modeling questions: form of rel/irrel distributions and parameters for those distributions
• In this paper, we assume multinomial models, and choose parameters by maximum a posteriori– Prior is background corpus word distribution
]|Pr[
]|Pr[]|Pr[
rd
rddr
August 9, 2006 ACM SIGIR 2006 Slide 16
Our Contributions
• Primary: EMP – metric as retrieval goal– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length, reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization problem tractable
• Secondary: retrieving for diversity (special case)– A natural side effect of optimizing for certain metrics
August 9, 2006 ACM SIGIR 2006 Slide 17
Objective
• Probability Ranking Principle (PRP): maximize at each step in ranking
• Expected Metric Principle (EMP): maximize for complete result set
• In particular for k-call, maximize:
]|Pr[ dr
]...|metric[ 1 nddE
]...|relevant k least at Pr[]...|relevant k least at [ 11 nn ddddE
August 9, 2006 ACM SIGIR 2006 Slide 18
Our Contributions
• Primary: EMP – metric as retrieval goal– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length, reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization problem tractable
• Secondary: retrieving for diversity (special case)– A natural side effect of optimizing for certain metrics
August 9, 2006 ACM SIGIR 2006 Slide 19
Optimization of Objective
• Exact optimization of objective is usually NP-hard– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at each rank
d1 Maximize E[metric | d]
August 9, 2006 ACM SIGIR 2006 Slide 20
Optimization of Objective
• Exact optimization of objective is usually NP-hard– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at each rank
d1 Fixed
d2 Maximize E[metric | d, d1]
August 9, 2006 ACM SIGIR 2006 Slide 21
Optimization of Objective
• Exact optimization of objective is usually NP-hard– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at each rank
d1 Fixed
d2 Fixed
d3 Maximize E[metric | d, d1, d2]
August 9, 2006 ACM SIGIR 2006 Slide 22
Greedy on 1-call and n-call
• 1-greedy– Greedy algorithm reduces to ranking each successive
document assuming all previous documents are irrelevant
– Algorithm has “discovered” incremental negative pseudorelevance feedback
• n-greedy: Assume all previous documents relevant
],...,,|Pr[
)]...(|Pr[)]...(Pr[]...Pr[
])...(Pr[]...Pr[
]...Pr[
121
111111
1111
21
ii
iiii
iii
i
rrrr
rrrrrrr
rrrrr
rrr
August 9, 2006 ACM SIGIR 2006 Slide 23
Greedy on Other Metrics
• Greedy with precision/recall reduces to PRP!
• Greedy on k-call for general k (k-greedy)– More complicated…
• Greedy with MSL, MRR, instance recall works out to 1-greedy algorithm– Intuition: to make first relevant document appear
earlier, we want to hedge our bets as to query interpretation (i.e., diversify)
August 9, 2006 ACM SIGIR 2006 Slide 24
Experiments Overview
• Experiments verify that optimizing for metric improves performance on metric– They do not tell us which metrics to use
• Looked at ad hoc diversity examples
• TREC topics/queries
• Tuned weights on separate development set
• Tested on:– Standard ad hoc (robust track) topics
– Topics with multiple annotators
– Topics with multiple instances
August 9, 2006 ACM SIGIR 2006 Slide 25
Diversity on Google Results
• Task: reranking top 1,000 Google results
• In optimizing 1-call, our algorithm finds more diverse results than PRP, Google results
August 9, 2006 ACM SIGIR 2006 Slide 26
Experiments: Robust Track
• TREC 2003, 2004 robust tracks– 249 topics
– 528,000 documents
• 1-call, 10-call results statistically significant
1-call 10-call MRR MSL P@10
PRP 0.791 0.020 0.563 3.052 0.333
1-greedy 0.835 0.004 0.579 2.763 0.269
10-greedy 0.671 0.084 0.517 3.992 0.337
August 9, 2006 ACM SIGIR 2006 Slide 27
Experiments: Instance Retrieval
• TREC-6,7,8 interactive tracks– 20 topics
– 210,000 documents
– 7 to 56 instances per topic
• PRP baseline: instance recall @ 10 = 0.234
• Greedy 1-call: instance recall @ 10 = 0.315
August 9, 2006 ACM SIGIR 2006 Slide 28
Experiments: Multi-annotator
• TREC-4,6 ad hoc retrieval– Independent annotators assessed same topics– TREC-4: 49 topics, 568,000 documents, 3 annotators– TREC-6: 50 topics, 556,000 documents, 2 annotators
More annotators more satisfied using 1-greedy
1-call (1) 1-call (2) 1-call (3) Total
TREC-4 PRP 0.735 0.551 0.653 1.939
TREC-4 1-greedy 0.776 0.633 0.714 2.122
TREC-6 PRP 0.660 0.620 N/A 1.280
TREC-6 1-greedy 0.800 0.820 N/A 1.620
August 9, 2006 ACM SIGIR 2006 Slide 29
Related Work
• Fits in risk minimization framework (objective as negative loss function)
• Other approaches look at optimizing for metrics directly, with training data
• Pseudorelevance feedback
• Subtopic retrieval
• Maximal marginal relevance
• Clustering
• See paper for references
August 9, 2006 ACM SIGIR 2006 Slide 30
Future Work
• General k-call (k = 2, etc.)– Determination if this is what users want
• Better underlying probabilistic model– Our contribution is in the ranking objective, not the
model model can be arbitrarily sophisticated
• Better optimization techniques– E.g., Local search would differentiate algorithms for
MRR and 1-call
• Other metrics– Preliminary work on mean average precision, precision
@ recall• (Perhaps) surprisingly, these metrics are not optimized by PRP!
August 9, 2006 ACM SIGIR 2006 Slide 31
Conclusions
• EMP: Metric can motivate model – choosing and believing in a metric already gives us a reasonable objective, E[metric]
• Can potentially apply EMP on top of a variety of different underlying probabilistic models
• Diversity is one practical example of a natural side effect of using EMP with the right metric
August 9, 2006 ACM SIGIR 2006 Slide 32
Acknowledgments
• Harr Chen supported by the Office of Naval Research through a National Defense Science and Engineering Graduate Fellowship
• Jaime Teevan, Susan Dumais, and anonymous reviewers provided constructive feedback
• ChengXiang Zhai, William Cohen, and Ellen Voorhees provided code and data