Advances on the Development of Evaluation Measures

Ben Carterette Evangelos Kanoulas Emine Yilmaz

Information Retrieval Systems

Match information seekers with

the information they seek

“What you can’t measure you can’t improve”

Lord Kelvin

Most retrieval systems are tuned to optimize for an objective

evaluation metric

Why is Evaluation so Important?

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

Online Evaluation

• Design interactive experiments

• Use users’ actions to evaluate the quality

Click/Noclick

Evaluate

Online Evaluation

• Standard click metrics – Clickthrough rate

– Queries per user

– Probability user skips over results they have considered (pSkip)

• Result interleaving

What is result interleaving? • A way to compare rankers online

– Given the two rankings produced by two methods

– Present a combination of the rankings to users

• Result interleaving

– Credit assignment based on clicks

Team Draft Interleaving (Radlinski et al., 2008)

• Interleaving two rankings

– Input: Two rankings

• Repeat: – Toss a coin to see which team picks next

– Winner picks their best remaining player

– Loser picks their best remaining player

– Output: One ranking

• Credit assignment

– Ranking providing more of the clicked results wins

Team Draft Interleaving Ranking A

1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org

Team Draft Interleaving Ranking A

1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org

B wins!

Offline Evaluation

• Controlled laboratory experiments

• The user’s interaction with the engine is only simulated – Ask experts to judge each query result

– Predict how users behave when they search

– Aggregate judgments to evaluate

Offline Evaluation

• Ask experts to judge each query result

• Predict how users behave when they search

• Aggregate judgments to evaluate

Documents Judge

Evaluate

User model

Online vs. Offline Evaluation

Online Offline

Pros Cheap

Measure actual user reactions

Fast to evaluate

Easy to try new ideas

Portable

Need to go live

Not duplicable

Needs ground truth

Slow to obtain judgments “Expensive”

“Inconsistent”

Difficult to model how users behave

Outline

Traditional Experiment

Results

Judges

Search Engines

How many good docs have I missed/found? 15

Depth-k Pooling

. . . .

. . . . . .

Judge Documents

Depth-k Pooling

. . . .

. . . . . .

Depth-k Pooling

. . . .

. . . . . .

. . . .

. . . . . .

Depth-k Pooling

. . . .

. . . . . .

. . . .

. . . . . .

Reusable Test Collections

• Document Corpus

• Topics

• Relevance Judgments

Topic 1 Topic 2 Topic N

Evaluation Metrics: Precision vs Recall

Retrieved list

10 . .

Visualizing Retrieval Performance: Precision-Recall Curves

Evaluation Metrics: Average Precision

Outline

User models Behind Traditional Metrics

• Precision@k

– Users always look at top k documents

– What fraction of the top k documents are relevant?

• Recall

– Users would like to find all the relevant documents.

– What fraction of these documents have been retrieved by the search engine?

User Model of Average Precision (Robertson ‘08)

1. User steps down a ranked list one-by-one

2. Stops browsing documents due to satisfaction

– stops with a certain probability after observing a relevant document

3. Gains utility from each relevant document

User Model of Average Precision (Robertson ‘08)

• The probability that the user stops browsing is uniform over all the relevant documents

• The utility a user gains when he stops browsing at a relevant document at rank n (precision at rank n)

• AP can be written as:

U(n) 1

nrel(k)

nUnPAP

o.w. 0 relevant, is doc if 1

User Model Based Evaluation Measures

• Directly aim at evaluating user satisfaction

– An effectiveness measure should be correlated to the user’s experience

• Thus interest in effectiveness measures based on explicit models of user interaction

– Devise a user model correlated with user behavior

– Infer an evaluation metric from the user model

Basic User Model

• Simple model of user interaction:

1. User steps down ranked results one-by-one

2. Stops at a document at rank k with some probability P(k)

3. Gains some utility U(k) from relevant documents

M U(k)P(k)k1

Basic User Model

1. Discount: What is the chance a user will visit a document?

– Model of the browsing behavior

2. Utility: What does the user gain by visiting a document?

Model Browsing Behavior

Position-based models

The chance of observing a document depends on the position

it is presented in the ranked list.

black powder

ammunition

Rank Biased Precision black powder

ammunition

Stop View Next

Rank Biased Precision black powder

ammunition

)-(1= RBP1=i

Discounted Cumulative Gain black powder

ammunition

Discounted Gain

1/log2(r+1)

Relevance Score

Discount by rank

Relevance Gain

12 rrel

Discounted Cumulative Gain

• DCG can be written as:

• Discount function models the probability that the user visits (clicks on) the document at rank r

– Currently, P(user clicks on doc r) = 1/log2(r+1)

UtilityrdocvisitsuserP

Discounted Cumulative Gain

• Instead of stopping probability, think about viewing probability

• This fits in discounted gain model framework:

Normalised Discounted Cumulative Gain

black powder

ammunition

Discounted Gain

1/log2(r+1)

Relevance Score

Discount by rank

Relevance Gain

12 rrel

optDCG

DCGNDCG

Model Browsing Behavior

Cascade-based models

black powder

ammunition

• The user views search results from top to bottom

• At each rank i, the user has a certain probability of being satisfied. • Probability of satisfaction proportional to the

relevance grade of the document at rank i.

• Once the user is satisfied with a document, he terminates the search.

Rank Biased Precision

View Next Item

black powder

ammunition

Expected Reciprocal Rank [Chapelle et al CIKM09]

Relevant?

View Next Item

no somewhat highly

black powder

ammunition

Expected Reciprocal Rank [Chapelle et al CIKM09]

black powder

ammunition

rrank at

document"perfect the" finding of Utility :(r)

1/r (r)

)position at stopsuser (1

)1(1 r

document r theof grade relevance : th

) positionat stopsP(user 2

12 doc of relevance of Prob.

maxrRr

Metrics derived from Query Logs

• Use the query logs to understand how users behave

• Learn the parameters of the user model from the query logs

– Utility, discount, etc.

• Users tend to stop search if they are satisfied or frustrated

• P(observe a doc at rank r) highly affected by snippet quality

Relevance P(C|R)

Bad 0.50

Fair 0.49

Good 0.45

Excellent 0.59

Perfect 0.79

Relevance P(Stop|R)

Bad 0.49

Fair 0.41

Good 0.37

Excellent 0.53

Perfect 0.76

• Users behave differently for different queries

– Informational queries

– Navigational queries

Navigational Informational

P(C|R) P(Stop|R) P(C|R) P(Stop|R)

Bad 0.632 0.587 0.516 0.431

Fair 0.569 0.523 0.455 0.357

Good 0.526 0.483 0.442 0.349

Excellent 0.700 0.669 0.533 0.458

Perfect 0.809 0.786 0.557 0.502

DEBU (r) P(Er ) P(C | Rr )

EBU DEBU (r)r 1

Expected Browsing Utility (Yilmaz et al. CIKM’10)

Basic User Model

1. Discount: What is the chance a user will visit a document?

– Model of the browsing behavior

2. Utility: What does the user gain by visiting a document?

– Mostly ad-hoc, no clear user model

Graded Average Precision (Robertson et al. SIGIR’10)

• One document is more useful than another • One possible meaning:

– one document is useful to more users than another

• Hence the following: – assume grades of relevance... ... but that user has a threshold relevance grade

which defines a binary view

– different users have different thresholds

described by a probability distribution over users

Graded Average Precision [Robertson et al. SIGIR10]

• User has binary view of relevance

– by thresholding the relevance scale

Considered relevant with probability g1

Irrelevant

Relevant

Highly Relevant

Relevance Scale

Graded Average Precision [Robertson et al. SIGIR10]

• User has binary view of relevance

– by thresholding the relevance scale

Irrelevant

Relevant

Highly Relevant

Relevance Scale

Considered relevant with probability g2

Graded Average Precision

• Assume relevance grades {0...c} – 0 for non-relevant, + c positive grades

• gi = P(user threshold is at i) for i ∈ {1...c} – i.e. user regards grades {i...c} as relevant, grades {0...(i-1)}

as not relevant – gis sum to one

• Step down the ranked list, stopping at documents that may be relevant – then calculate expected precision at each of these

(expected over the population of users)

Graded Average Precision (GAP)

Relevance

with prob.

4prec6

Relevance

with prob.

2prec6

Relevance

wprec6 4

Probability Models

• Almost all the measures we’ve discussed are based on probabilistic models of users

– Most have one or more parameters representing something about user behavior

– Is there a way to incorporate variability in the user population?

• How do we estimate parameter values?

– Is a single point estimate good enough?

Choosing Parameter Values

• Parameter θ models a user – Higher θ more patience, more results viewed

– Lower θ less patience, fewer results viewed

• Different approaches: – Minimize variance in evaluation (Kanoulas & Aslam,

CIKM ‘09)

– Use click log; fit a model to gaps between clicks (Zhang et al., IRJ, 2010)

– All try to infer a single value for the parameters

Distribution of “Patience” for RBP

• Form a distribution P(θ)

• Sampling from P(θ) is like sampling a “user” defined by their patience

• How can we form a proper distribution of θ?

• Idea: mine logged search engine user data – Look at ranks users are clicking

– Estimate patience based on absence or presence of clicks

Modeling Patience from Log Data

• We will assume we have a flat prior θ that we want to update using log data L

• Decompose L into individual search sessions

– For each session q, count:

• cq, the total number of clicks

• rq, the total number of no-clicks

– Model cq with a negative binomial distribution conditional on rq and θ:

Modeling Patience from Log Data

• Marginalize P(θ|L) over r:

• Apply Bayes’ rule to P(θ | r, L):

• P(L | θ, r) is the likelihood of the observed clicks

Complete Model Expression

• Model components result in three equations to estimate P(θ | L)

Empirical Patience Profiles: Navigational Queries

Empirical Patience Profiles: Informational Queries

Extend to ERR Parameters

Evaluation Using Parameter Distributions

• Monte Carlo procedure:

– Sample a parameter value from P(θ | L)

• Or a vector of values for ERR

– Compute the measure with the sampled value

– Iterate to form distribution P(RBP) or P(ERR)

Marginal Distribution Analysis

• S1=[R N N N N N N N N N]

• S2=[N R R R R R R R R R]

Distribution of RBP

Distribution of ERR

• Given two systems, over all choices of θ

– What is P(M1 > M2)?

– What is P((M1 - M2)>t)?

Outline

Why sessions?

• Current evaluation framework – Assesses the effectiveness of systems over one-

shot queries

• Users reformulate their initial query

• Still fine if … – optimizing system for one-shot queries led to

optimal performance over an entire session

When was the DuPont Science Essay Contest created?

Initial Query : DuPont Science Essay Contest

Reformulation : When was the DSEC created?

• e.g. retrieval systems should accumulate information along a session

Why sessions?

Paris Luxurious Hotels Paris Hilton J Lo Paris

Extend the evaluation framework

From one query evaluation

To multi-query sessions evaluation

Construct appropriate test collections

Rethink of evaluation measures

• A set of information needs A friend from Kenya is visiting you and you'd like to surprise him with

by cooking a traditional swahili dish. You would like to search online to decide which dish you will cook at home.

– A static sequence of m queries

Basic test collection

Initial Query : 1st Reformulation : 2nd Reformulation : … (m-1)th Reformulation :

kenya cooking traditional

kenya cooking traditional swahili

www.allrecipes.com

kenya swahili traditional food recipes

Basic Test Collection

Factual/Amorphous, Known-item search

Intellectual/Amorphous, Explanatory search

Factual/Amorphous, Known-item search

Experiment

kenya cooking

traditional swahili

kenya swahili

traditional food

recipes

kenya cooking

traditional

Experiment

kenya cooking

traditional swahili

kenya cooking

traditional

kenya swahili

traditional food

recipes

Construct appropriate test collections

Rethink of evaluation measures

What is a good system?

How can we measure “goodness”?

Measuring “goodness”

The user steps down a ranked list of documents and

observes each one of them until a decision point and either

a) abandons the search, or

b) reformulates

While stepping down or sideways, the user accumulates utility

What are the challenges?

Evaluation over a single ranked list

kenya cooking

traditional swahili

kenya cooking

traditional

kenya swahili

traditional food

recipes

Session DCG [Järvelin et al ECIR 2008]

kenya cooking

traditional swahili

kenya cooking

traditional

2rel(r ) 1

logb (r b 1)r1

2rel(r ) 1

logb (r b 1)r1

logc (1 c 1)DCG(RL1)

logc(2 c 1) DCG(RL2)

Session Metrics

• Session DCG [Järvelin et al ECIR 2008]

The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]

Model-based measures

Probabilistic space of users following

different paths

• Ω is the space of all paths

• P(ω) is the prob of a user following a path ω in Ω

• U(ω) is the utility of path ω in Ω

[Yang and Lad ICTIR 2009]

P()U()

Expected Global Utility [Yang and Lad ICTIR 2009]

1. User steps down ranked results one-by-one

2. Stops browsing documents based on a stochastic process that defines a stopping probability distribution over ranks and reformulates

3. Gains something from relevant documents, accumulating utility

Expected Global Utility [Yang and Lad ICTIR 2009]

• The probability of a user following a path ω:

P(ω) = P(r1, r2, ..., rK)

ri is the stopping and reformulation point in list i

– Assumption: stopping positions in each list are independent

P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)

– Use geometric distribution (RBP) to model the stopping and reformulation behaviour

P(ri = r) = (1-) k1

Q1 Q2 Q3

Expected Global Utility

Session Metrics

• Expected global utility [Yang and Lad ICTIR 2009]

The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]

Model-based measures

Probabilistic space of users following

different paths

• Ω is the space of all paths

• P(ω) is the prob of a user following a path ω in Ω

• Mω is a measure over a path ω

[Kanoulas et al. SIGIR 2011]

esM P()M

Probability of a path

Probability of abandoning at reform 2

Probability of reformulating at rank 3

Q1 Q2 Q3

… … …

(1) (2)

Q1 Q2 Q3

… … …

Probability of abandoning the session at reformulation i

Geometric w/ parameter preform

Q1 Q2 Q3

… … …

Truncated Geometric w/ parameter preform

Probability of abandoning the session at reformulation i

Q1 Q2 Q3

… … …

Truncated Geometric w/ parameter preform

Probability of reformulating

at rank j (2)

Session Metrics

• Expected global utility [Yang and Lad ICTIR 2009]

The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]

• Expected session measures [Kanoulas et al. SIGIR 2011]

The user steps down a ranked list of documents until a decision point and either abandons the query or reformulates [Stochastic; allows early abandonment]

Outline

Novelty

• The redundancy problem:

– the first relevant document contains some useful information

– every document with the same information after that is worth less to the user

• but worth the same to traditional evaluation measures

• Novelty retrieval attempts to ensure that ranked results do not have much redundancy

Example

• query: “oil-producing nations” – members of OPEC

– North Atlantic nations

– South American nations

• 10 relevant articles about OPEC probably not as useful as one relevant article about each group – And one relevant article about all oil-producing

nations might be even better

How to Evaluate?

• One approach:

– List subtopics, aspects, or facets of the topic

– Judge each document relevant or not to each possible subtopic

• For oil-producing nations, subtopics could be names of nations

– Saudi Arabia, Russia, Canada, …

Subtopic Relevance Example

Evaluation Measures

• Subtopic recall and precision (Zhai et al., 2003) – Subtopic recall at rank k:

• Count unique subtopics in top k documents • Divide by total number of known unique subtopics

– Subtopic precision at recall r: • Find least k at which subtopic recall r is achieved • Find least k at which subtopic recall r could possibly be

achieved (by a perfect system) • Divide latter by former

– Models a user that wants all subtopics • and doesn’t care about redundancy as long as they are

seeing new information

Subtopic Relevance Evaluation

Diversity

• Short keyword queries are inherently ambiguous

– An automatic system can never know the user’s intent

• Diversification attempts to retrieve results that may be relevant to a space of possible intents

Evaluation Measures

• Subtopic recall and precision

– This time with judgments to “intents” rather than subtopics

• Measures that know about intents:

– “Intent-aware” family of measures (Agrawal et al.)

– D, D♯ measures (Sakai et al.)

– α-nDCG (Clarke et al.)

– ERR-IA (Chapelle et al.)

Intent-Aware Measures

• Assume there is a probability distribution P(i | Q) over intents for a query Q

– Probability that a randomly-sampled user means intent i when submitting query Q

• The intent-aware version of a measure is its weighted average over this distribution

P@10-IA = 0.35*0.3 + 0.35*0.3 + 0.2*0.2 + 0.08*0.1 + 0.02*0.1 = 0.23

D-measure

• Take the idea of intent-awareness and apply it to computing document gain

– The gain for a document is the (weighted) average of its gains for subtopics it is relevant to

• D-nDCG is nDCG computed using intent-aware gains

D-DCG = 0.35/log 2 + 0.35/log 3 + …

α-nDCG

• α-nDCG is a generalization of nDCG that accounts for both novelty and diversity

• α is a geometric penalization for redundancy

– Redefine the gain of a document:

• +1 for each subtopic it is relevant to

• ×(1-α) for each document higher in the ranking that subtopic already appeared in

• Discount is the same as usual

+(1-α)

+(1-α)2

ERR-IA

• Intent-aware version of ERR

• But it has appealing properties other IA measures do not have:

– ranges between 0 and 1

– submodularity: diminishing returns for relevance to a given subtopic -> built-in redundancy penalization

• Also has appealing properties over α-nDCG:

– Easily handles graded subtopic judgments

– Easily handles intent distributions

Granularity of Judging

• What exactly is a “subtopic”?

– Perhaps any piece of information a user may be interested in finding?

• At what granularity should subtopics be defined?

– For example:

• “cardinals” has many possible meanings

• “cardinals baseball team” is still very broad

• “cardinals baseball team schedule” covers 6 months

• “cardinals baseball team schedule august” covers ~25 games

• “cardinals baseball team schedule august 12th”

Preference Judgments for Novelty

• What about evaluating novelty with no subtopic judgments?

• Preference judgments:

– Is document A more relevant than document B?

• Conditional preference judgments:

– Is document A better than document B given that I’ve just seen document C?

– Assumption: preference is based on novelty over C

• Is it true? Come to our presentation on Wednesday…

Conclusions

• Strong interest in using evaluation measures to model user behavior and satisfaction – Driven by availability of user logs, increased

computational power, good abstract models

– DCG, RBP, ERR, EBU, session measures, diversity measures all model users in different ways

• Cranfield-style evaluation is still important!

• But there is still much to understand about users and how they derive satisfaction

Conclusions

• Ongoing and future work: – Models with more degrees of freedom

– Direct simulation of users from start of session to finish

– Application to other domains

• Thank you! – Slides will be available online

– http://ir.cis.udel.edu/SIGIR12tutorial

Advances on the Development of Evaluation Measures

Documents

Transcript of Advances on the Development of Evaluation Measures

Ranking Metrics and Evaluation Measures

Ben Carterett — Advances in Information Retrieval Evaluation

Program Evaluation - Using Multidimensional Poverty Measures

EVALUATION OF MEASURES OF SENSORY PROCESSING AND INATTENTION IN …wiredspace.wits.ac.za/jspui/bitstream/10539/18685/1/EVALUATION OF... · EVALUATION OF MEASURES OF SENSORY PROCESSING

Teacher Evaluation Student Growth Multiple Measures

Investigation and evaluation of energy efﬁciency measures ...

Statistical evaluation of image quality measures - …busim.ee.boun.edu.tr/~sankur/SankurFolder/Quality.pdfStatistical evaluation of image quality measures ... Image quality measures

4. Advances in impact evaluation - WHO · 4. Advances in impact evaluation Rigour in impact evaluation without the intervention – known as the ‘counterfactual’ – in order

Ben Carterette "Advances in Information Retrieval Evaluation"

Performance Evaluation Measures of Portfolio

EVALUATION OF PROTECTION MEASURES AGAINST AVALANCHES IN ... › snow-science › objects › ISSW16_P1.12.… · EVALUATION OF PROTECTION MEASURES AGAINST AVALANCHES IN FORESTED TERRAIN

Title: Recent advances in the development and evaluation ...

Independent Evaluation of the Corrective Measures … · Evaluation of Corrective Measures Study RWMA Independent Evaluation of the Corrective Measures Study, Mixed Waste Landfill,

EVALUATION OF ENERGY EFFICIENCY MEASURES APPLIED …

Advances in colour-differences evaluation

Advances in Arctic Oil-Spill Response Measures and Clean ...

Entity Resolution Evaluation Measures - Honorshonors.cs.umd.edu/reports/hitesh.pdfEntity Resolution Evaluation Measures Hitesh Maidasani, Galileo Namata, Bert Huang, and Lise Getoor

LADCO Evaluation of Candidate Control Measures

NTCIR Evaluation Activities: Recent Advances on RITE ...

ADVANCES IN FORMATION EVALUATION INDEPENDENT OF CONVEYANCE METHOD