Methods for Evaluating Interactive Information Retrieval Systems
Evaluating Effectiveness of Information Retrieval System
description
Transcript of Evaluating Effectiveness of Information Retrieval System
Evaluating Effectiveness ofInformation Retrieval System
Jing He, Yue LuApril 21, 2023
Outline
• System Oriented Evaluation– Cranfield Paradigm– Measure– Test Collection– Incomplete Relevance Judgment Evaluation
• User Oriented Evaluation
Cranfield Paradigm
• established by [Cleverdon etal. 66]• Test collection
– Document collection– Topic set– Relevance judgment
• Measures
Measure: Binary Relevance• Binary retrieval
– precision and recall• Ranking retrieval
– P-R curve• full information• measures below are summary of it
– P@N: insensitive, local information, not average well– Average Precision
• Geometric interpretation• Utility interpretation
– R-precision: break even point; approximate area– RR: appropriate for known-item search
Measure: Graded Relevance
• Discounted Cumulated Gain
– discountedFun is always logb
• RBP– assume stop probability p at any document
DGi discountFun(relevanceFun(di),i)
DCG DGii
RBP (1 p) ri pi 1
i
Measure: Topic Set Integration
• Arithmetic Average– MAP, MRR, average P-R curve– P@n and DCG: normalize
• Geometric Average– GMAP: focus on difficulty topic
• Standardize [Webber etal SIGIR08]
– Average and standardize as normal distribution
Measure: Compare Systems
• integration score difference– Depend on topic number, difficulty, etc
• Significance tests– factors
• null hypothesis: two systems have identical performance• test criterion: integration score difference• significance level
– t-test, randomization test, bootstrap test agree to each other and more powerful (than sign and Wilcoxon test)[Smucher etal CIKM07, Cormack etal SIGIR07]
Measure: How good
• Relationship– Correlation
• Correlation between system performance rankings by different measures: use kendall’s τ or some variant [Yilmaz etal SIGIR08]
• All measures are highly correlated, especially AP, R-precision and nDCG with fair weight setting [Voorhees TREC99, Kekalainen IPM05]
– Inference Ability [Aslam etal SIGIR05]
• measure m1 score measure m2 score ?• AP and R-precision has inference ability for P@n, not hold
on the contrary
Test Collection: document and topic
• Document collection– Newspaper, newswire, etc Web Page Set– 1 billion pages(25TB) for TREC09 Web track– But no much research on how to construct document
collection for IR evaluation• Topic set
– Human designed search engine query log– How to select discriminative topics?
• [Mizzaro and Robetson SIGIR07] proposes a method, but it can only applied posteriori
Test collection: relevance judgment
• judgment agreement– agreement rate is low– System performance ranking is stable between
topic originator and experts for topic but not for the others [Bailey etal SIGIR08]; and also stable between TREC assessors [Voorhees and Harman 05]
Test Collection: relevance judgment
• How to select document to judge– Pooling (by [Jones etal 76])– Limitation of pooling
• Bias to contributed systems• Bias to title words• Not efficient enough
Incomplete Relevance Judgment Evaluation
• Motivation– dynamic, growing collection v.s constant human
labor
• Problems– Does traditional measures still work?– How to select document to judge?
Incomplete Problem:Measures
• Buckley and Voorhees’s bpref[Buckley and Voorhees SIGIR04]
– Penalize by |irrelevant doc above relevant doc| • Sakai’s condensed measures[Sakai SIGIR07]
– Just remove the unjudged documents• Yilmaz and Aslam’s infAP[Yilmaz etal CIKM06, SIGIR08]
– Estimate average precision with uniform distribution• Results
– infAP, condensed measure, nDCG are more robust than bpref for random sampling judgment from pooling
– infAP is more appropriate to estimate absolute AP value
Incomplete Problem: select document to judge(1)
• Aslam’s statAP[Aslam etal SIGIR06, Allan etal TREC07, Yilzma etal SIGIR08]
– Extension of infAP (based on uniform sampling)– uniform sampling: too few relevant document– Stratified sampling
• Higher sampling probability for document ranked higher by more retrieval system (like voting)
– Estimate
Incomplete Problem: select document to judge(2)
• Carterett’s minimal test collection– Select most “discriminative” document to judge– How to define “discriminative”?
• By how the AP difference boundary changes with the relevance knowledge of this document
– Estimate AP
Incomplete Problem
• It’s more reliable to handle incomplete problem with more queries with less judgment each
• statAP is more appropriate for estimating absolute AP value
• Minimal test collection is more appropriate for discriminating systems
User Oriented Evaluation - Alternative to batch-mode evaluation• Conduct user studies [Kagolovsky&Moehr 03]
o where actual users would use system and assess the quality of the search process and results.
• Advantage: • allows us to see actual utility system, and provides more
interpretability in terms of the usefulness of the system. • Deficiencies:
o difficult to compare two system reliably in the same context. o expensive to invite many users to participate in the
experiments.
Criticism of batch-mode evaluation[Kagolovsky&Moehr 03][Harter&Hert ARIST97] • Expensive judgments
o obtaining relevance judgments is time consumingo How to overcome? predict relevance with implicit
information which is easy to collect with real systems.• Judgment = user need?
o judgment may not represent real users' information needs thus the evaluation results may not reflect the real utility of the system
o whether batch evaluation correlates well with user evaluation?
Expensive judgments (1)
• [Carterette&Jones 07NIPS] • Predict the relevance score (nDCG) using clicks after an
initial training phase.• Can identify the better of two ranking 82% of the time with
no relevance judgment and 94% of the time with only two judgment for each query
• [Joachims 03TextMining] • Compare two systems by using click-through data on the
mixture ranking list which is generated by interleaving the results from the two systems.
• Results closely followed the relevance judgments using P@n
Expensive judgments (2)
• [Radlinski et al 08CIKM] • "absolute usage metrics" (such as clicks per quary,
frequency of query reformulations) fail to reflect the retrieval quality
• "paired comparison test" produces reliable predictions• Summary
o reliable pair-wise comparison availableo reliable absolute prediction of relevance scores is still an
open research question
Judgment = user need ? (1)
• Negative correlationo [Hersh et al 00SIGIR] 24 users for 6 instance-recall tasks. o [Turpin&Hersh 01SIGIR] 24 users for 6 QA tasks. o Both no significant difference in user task effectiveness
found between systems with significantly different MAP. o small number of topics may explain why no correlation was
detectedo Mixed correlation
o [Turpin&Scholer 06SIGIR] two exp on 50 queries: o one precision-based user task(finding the first relevant doc)o one recall-based user task(# of relevant doc found in five min)o Results: no significant relationship between system
effectiveness and user effectiveness in precision task, and significant but week relationship wtih recall-based task.
Judgment = user need ? (2)
• Positive correlationo [Allan et al 05SIGIR] 33 users, 45 topics, differences in bpref
(0.5-0.98) could result in significant differences in user effectiveness of retrieving faceted doc passages.
o [Huffman&Hochster 07SIGIR] 7 participants, 200 Google queries, satisfaction of assessors correlate fairly strongly with relevance among top three doc measured using a version of nCDG.
o [Al-Maskari et al 08SIGIR] 56 users, recall-based task on 56 queries on top of "good" and "bad" systems. The authors showed that user effectiveness (time consumed, relevant doc collected, queries input, satisfaction etc) and system effectiveness(P@n, MAP) are highly corrective.
Judgment = user need ? (3)
• Summary• Although there are limitations of the batch-
mode relevance evaluation, most recent studies showed high correlation between user evaluation and system evaluation using relevance measures.