evaluation in infomation retrival

80
Evaluation in Information Retrieval Evaluation in Information Retrieval Ruihua Song Web Search and Mining Group Email: [email protected]

description

the evaluation method ir field

Transcript of evaluation in infomation retrival

Page 1: evaluation in infomation retrival

Evaluation in Information RetrievalEvaluation in Information Retrieval

Ruihua SongWeb Search and Mining GroupEmail: [email protected]

Page 2: evaluation in infomation retrival

OverviewOverview

• Retrieval Effectiveness Evaluation• Evaluation Measures• Significance Test• One Selected SIGIR Paper

Page 3: evaluation in infomation retrival

How to evaluate?How to evaluate?

• How well does system meet information need? System evaluation: how good are

document rankings? User-based evaluation: how satisfied

is user?

Page 4: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 5: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 6: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 7: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 8: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 9: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 10: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 11: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 12: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 13: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 14: evaluation in infomation retrival

SIGIR'05 Keynote given by Amit Singhal from Google

Evaluation Challenges On The WebEvaluation Challenges On The Web• Collection is dynamic 10-20% urls change every month

• Queries are time sensitive Topics are hot then they ae not

• Spam methods evolve Algorithms evaluated against last month’s

web may not work today• But we have a lot of users… you can use

clicks as supervision

Page 15: evaluation in infomation retrival

OverviewOverview

• Retrieval Effectiveness Evaluation• Evaluation Measures• Significance Test• One Selected SIGIR Paper

Page 16: evaluation in infomation retrival

Ellen Voorhees, The TREC Conference: An Introduction

Page 17: evaluation in infomation retrival

P-R curveP-R curve

• Precision and recall• Precision-recall curve• Average precision-recall curve

Page 18: evaluation in infomation retrival

P-R curve (cont.)P-R curve (cont.)

• For a query there is a result list (answer set)

R(Relevant Docs)

A (Answer Set)Ra

Page 19: evaluation in infomation retrival

P-R curve (cont.)P-R curve (cont.)• Recall is fraction of

the relevant document which has been retrieved

• Precision is fraction of the retrieved document which is relevant

| || |

| || |

RarecallR

RaprecisionA

=

=

Page 20: evaluation in infomation retrival

P-R curve (cont.)P-R curve (cont.)• E.g. For some query, |Total Docs|=200,|R|=20 r: relevant n: non-relevant At rank 10,recall=6/20,precision=6/10

123 84 5 87 80 59 90 8 89 55, , , , , , , , , ,...r n n r r n r n r rd d d d d d d d d d

Page 21: evaluation in infomation retrival

Individual query P-R curveIndividual query P-R curve

Page 22: evaluation in infomation retrival

P-R curve (cont.)P-R curve (cont.)

Page 23: evaluation in infomation retrival

MAPMAP• Mean Average Precision• Defined as mean of the precision obtained

after each relevant document is retrieved, using zero as the precision for document that are not retrieved.

Page 24: evaluation in infomation retrival

MAP (cont.)MAP (cont.)• E.g.

|Total Docs|=200, |R|=20 The whole result list consist of 10 docs is as follow r-rel n-nonrelevant MAP = (1+2/4+3/5+4/7+5/9+6/10)/6

123 84 5 87 80 59 90 8 89 55, , , , , , , , ,r n n r r n r n r rd d d d d d d d d d

Page 25: evaluation in infomation retrival

Precision at 10Precision at 10• P@10 is the number of relevant documents in

the top 10 documents in the ranked list returned for a topic

• E.g. there is 3 documents in the top 10

documents that is relevant P@10=0.3

Page 26: evaluation in infomation retrival

Mean Reciprocal RankMean Reciprocal Rank• MRR is the reciprocal of the first relevant

document’s rank in the ranked list returned for a topic

• E.g. the first relevant document is ranked as

No.4 MRR = ¼ = 0.25

Page 27: evaluation in infomation retrival

bprefbpref• Bpref stands for Binary Preference• Consider only judged docs in result list• The basic idea is to count number of time

judged non-relevant docs retrieval before judged relevant docs

Page 28: evaluation in infomation retrival

bpref (cont.)bpref (cont.)

Page 29: evaluation in infomation retrival

bpref (cont.)bpref (cont.)

• E.g. |Total Docs| =200, |R|=20 r: judged relevant n: judged non-relevant u: not judged, unknown whether relevant or

not

123 84 5 87 80 59 90 8 89 55, , , , , , , , , ,...r n n u r n r u u rd d d d d d d d d d

Page 30: evaluation in infomation retrival

ReferencesReferences• Baeza-Yates, R. & Ribeiro-Neto, B.

Modern Information Retrieval Addison Wesley, 1999 , 73-96

• Buckley, C. & Voorhees, E.M.Retrieval Evaluation with Incomplete Information Proceedings of SIGIR 2004

Page 31: evaluation in infomation retrival

NDCGNDCG

• Two assumptions about ranked result list Highly relevant document are more

valuable The greater the ranked position of a

relevant document , the less valuable it is for the user

Page 32: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

• Graded judgment -> gain vector• Cumulated Gain

Page 33: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

• Discounted CG• Discounting function

Page 34: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

• Ideal (D)CG vector

Page 35: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

Page 36: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

• Normalized (D)CG

Page 37: evaluation in infomation retrival

NDCG (cont.)NDCG (cont.)

Page 38: evaluation in infomation retrival

NDCG (cont.) NDCG (cont.) • Pros. Graded, more precise than R-P Reflect more user behavior (e.g. user

persistence) CG and DCG graphs are intuitive to

interpret

• Cons. Disagreements in rating How to set parameters

Page 39: evaluation in infomation retrival

ReferenceReference• Jarvelin, K. & Kekalainen, J.

Cumulated Gain-based Evaluation of IR Techniques ACM Transactions on Information Systems, 2002 , 20 , 422-446

Page 40: evaluation in infomation retrival

OverviewOverview

• Retrieval Effectiveness Evaluation• Evaluation Measures• Significance Test• One Selected SIGIR Paper

Page 41: evaluation in infomation retrival

Significance TestSignificance Test

• Significance Test Why is it necessary? T-Test is chosen in IR experiments

• Paired• Two-tailed / One-tailed

Page 42: evaluation in infomation retrival

Is the difference significant?Is the difference significant?• Two almost same systems

score

p(.)

score

p(.)

Green < Yellow ?

The difference is significant or just caused by chance

Page 43: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-TestT-Test• 样本均值和总体均值的比较

为了判断观察出的一组计量数据是否与其总体均值接近,两者的相差是同一总体样本与总体之间的误差,还是已超出抽样误差的允许范围而存在显著差别?

• 成对资料样本均值的比较

有时我们并不知道总体均值,且数据成对关联。我们可以先初步观察每对数据的差别情况,进一步算出平均相差为样本均值,再与假设的总体均值比较看相差是否显著

Page 44: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 45: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 46: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 47: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 48: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 49: evaluation in infomation retrival

医学理论第七章 摘自www.37c.com.cn

T-Test (cont.)T-Test (cont.)

Page 50: evaluation in infomation retrival

OverviewOverview

• Retrieval Effectiveness Evaluation• Evaluation Measures• Significance Test• One Selected SIGIR Paper

T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback,Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.

Page 51: evaluation in infomation retrival

First AuthorFirst Author

Page 52: evaluation in infomation retrival

IntroductionIntroduction• The user study is different in at least two respects

from previous work The study provides detailed insight into the users’

decision-making process through the use of eyetracking

Evaluate relative preference signals derived from user behavior

• Clicking decisions are biased at least two ways, trust bias and quality bias

• Clicks have to be interpreted relative to the order of presentation and relative to the other abstracts

Page 53: evaluation in infomation retrival

User StudyUser Study• Designed these studies to not only record

and evaluate user actions, but also to give insight into the decision process that lead the user to the action

• This is achieved by recording users’ eye movements by Eye tracking

Page 54: evaluation in infomation retrival

Questions UsedQuestions Used

Page 55: evaluation in infomation retrival

Two Phases of the StudyTwo Phases of the Study• Phase I

34 participants Start search with Google query, search for

answers• Phase II

Investigate how users react to manipulations of search results

Same instructions as phase I Each subject assigned to one of three

experimental conditions• Normal• Swapped• Reversed

Page 56: evaluation in infomation retrival

Explicit Relevance JudgmentsExplicit Relevance Judgments• Collected explicit relevance judgments for all queries

and results pages Phase I

• Randomized the order of abstracts and asked jugdes to (weakly) order the abstracts

Phase II• The set for judging includes more• Abstracts and Web pages

• Inter-judge agreements Phase I: 89.5% Phase II: abstract 82.5%, page 86.4%

Page 57: evaluation in infomation retrival

EyetrackingEyetracking• Fixations 200-300 milliseconds Used in this paper

• Saccades 40-50 milliseconds

• Pupil dilation

Page 58: evaluation in infomation retrival

Analysis of User BehaviorAnalysis of User Behavior• Which links do users view and click?

• Do users scan links from top to bottom?

• Which links do users evaluate before clicking?

Page 59: evaluation in infomation retrival

Which links do users view and click?Which links do users view and click?

• Almost equal frequency of 1st and 2nd link, but more clicks on 1st link

• Once the user has started scrolling, rank appears to become less of an influence

Page 60: evaluation in infomation retrival

Do users scan links from top to bottom?Do users scan links from top to bottom?

• Big gap before viewing 3rd ranked abstract• Users scan viewable results thoroughly before

scrolling

Page 61: evaluation in infomation retrival

Which links do users evaluate before clicking?Which links do users evaluate before clicking?

• Abstracts closer above the clicked link are more likely to be viewed

• Abstract right below a link is viewed roughly 50% of the time

Page 62: evaluation in infomation retrival

Analysis of Implicit FeedbackAnalysis of Implicit Feedback• Does relevance influence user decisions?

• Are clicks absolute relevance judgments?

• Are clicks relative relevance judgments?

Page 63: evaluation in infomation retrival

Does relevance influence user decisions?Does relevance influence user decisions?• Yes• Use the “reversed” condition

Controllably decreases the quality of the retrieval function and relevance of highly ranked abstracts

• Users react in two ways View lower ranked links more frequently, scan

significantly more abstracts Subjects are much less likely to click on the first

link, more likely to click on a lower ranked link

Page 64: evaluation in infomation retrival

Are clicks absolute relevance judgments?Are clicks absolute relevance judgments?• Interpretation is problematic• Trust Bias Abstract ranked first receives more clicks

than the second• First link is more relevant (not influenced by

order of presentation) or• Users prefer the first link due to some level of

trust in the search engine (influenced by order of presentation)

Page 65: evaluation in infomation retrival

Trust BiasTrust Bias

• Hypothesis that users are not influenced by presentation order can be rejected

• Users have substantial trust in search engine’s abilityto estimate relevance

Page 66: evaluation in infomation retrival

Quality BiasQuality Bias• Quality of the ranking influences the user’s

clicking behavior If relevance of retrieved results decreases,

users click on abstracts that are on average less relevant

Confirmed by the “reversed” condition

Page 67: evaluation in infomation retrival

Are clicks relative relevance judgments?Are clicks relative relevance judgments?• An accurate interpretation of clicks needs to

take two biases into consideration, but they are they are difficult to measure explicitly User’s trust into quality of search engine Quality of retrieval function itself

• How about interpreting clicks as pairwisepreference statements?

• An example

Page 68: evaluation in infomation retrival

Comments: • Takes trust and quality bias into consideration• Substantially and significantly better than random• Close in accuracy to inter judge agreement

In the example,In the example,

Page 69: evaluation in infomation retrival

Experimental ResultsExperimental Results

Page 70: evaluation in infomation retrival

Comments:• Slightly more accurate than Strategy 1• Not a significant difference in Phase II

In the example,In the example,

Page 71: evaluation in infomation retrival

Experimental ResultsExperimental Results

Page 72: evaluation in infomation retrival

Comments:• Accuracy worse than Strategy 1• Ranking quality has an effect on the accuracy

In the example,In the example,

Page 73: evaluation in infomation retrival

Experimental ResultsExperimental Results

Page 74: evaluation in infomation retrival

Comments:• No significant differences compared to Strategy 1

In the example,

Rel(l5) > Rel(l4)

In the example,

Rel(l5) > Rel(l4)

Page 75: evaluation in infomation retrival

Experimental ResultsExperimental Results

Page 76: evaluation in infomation retrival

Comments:• Highly accurate in the “normal” condition• Misleading

Aligned preferences probably less valuable for learning Better results even if user behaves randomly

• Less accurate than Strategy 1 in the “reversed” condition

In the example,

Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6)

In the example,

Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6)

Page 77: evaluation in infomation retrival

Experimental ResultsExperimental Results

Page 78: evaluation in infomation retrival

ConclusionConclusion• Users’ clicking decisions influenced by search bias

and quality bias, so it is difficult to interpret clicks as absolute feedback

• Strategies for generating relative relevance feedback signals, which are shown to correspond well with explicit judgments

• While implicit relevance signals are less consistent with explicit judgments than the explicit judgments among each other, but the difference is encouragingly small

Page 79: evaluation in infomation retrival

SummarySummary

• Retrieval Effectiveness Evaluation• Evaluation Measures• Significance Test• One Selected SIGIR Paper

T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback,Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.

Page 80: evaluation in infomation retrival

Thanks!Thanks!