Advances on the Development of Evaluation Measures

120
Advances on the Development of Evaluation Measures Ben Carterette Evangelos Kanoulas Emine Yilmaz

Transcript of Advances on the Development of Evaluation Measures

Page 1: Advances on the Development of Evaluation Measures

Advances on the Development of Evaluation Measures

Ben Carterette Evangelos Kanoulas Emine Yilmaz

Page 2: Advances on the Development of Evaluation Measures

Information Retrieval Systems

Match information seekers with

the information they seek

Page 3: Advances on the Development of Evaluation Measures

“What you can’t measure you can’t improve”

Lord Kelvin

3

Most retrieval systems are tuned to optimize for an objective

evaluation metric

Why is Evaluation so Important?

Page 4: Advances on the Development of Evaluation Measures

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

4

Page 5: Advances on the Development of Evaluation Measures

Online Evaluation

• Design interactive experiments

• Use users’ actions to evaluate the quality

Click/Noclick

Evaluate

5

Page 6: Advances on the Development of Evaluation Measures

Online Evaluation

• Standard click metrics – Clickthrough rate

– Queries per user

– Probability user skips over results they have considered (pSkip)

• Result interleaving

Page 7: Advances on the Development of Evaluation Measures

What is result interleaving? • A way to compare rankers online

– Given the two rankings produced by two methods

– Present a combination of the rankings to users

• Result interleaving

– Credit assignment based on clicks

Page 8: Advances on the Development of Evaluation Measures

Team Draft Interleaving (Radlinski et al., 2008)

• Interleaving two rankings

– Input: Two rankings

• Repeat: – Toss a coin to see which team picks next

– Winner picks their best remaining player

– Loser picks their best remaining player

– Output: One ranking

• Credit assignment

– Ranking providing more of the clicked results wins

Page 9: Advances on the Development of Evaluation Measures

Team Draft Interleaving Ranking A

1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org

A B

Page 10: Advances on the Development of Evaluation Measures

Team Draft Interleaving Ranking A

1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org

B wins!

Page 11: Advances on the Development of Evaluation Measures

Offline Evaluation

• Controlled laboratory experiments

• The user’s interaction with the engine is only simulated – Ask experts to judge each query result

– Predict how users behave when they search

– Aggregate judgments to evaluate

11

Page 12: Advances on the Development of Evaluation Measures

Offline Evaluation

• Ask experts to judge each query result

• Predict how users behave when they search

• Aggregate judgments to evaluate

Documents Judge

Evaluate

User model

12

Page 13: Advances on the Development of Evaluation Measures

Online vs. Offline Evaluation

Online Offline

Pros Cheap

Measure actual user reactions

Fast to evaluate

Easy to try new ideas

Portable

Cons

Need to go live

Noisy

Slow

Not duplicable

Needs ground truth

Slow to obtain judgments “Expensive”

“Inconsistent”

Difficult to model how users behave

13

Page 14: Advances on the Development of Evaluation Measures

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

14

Page 15: Advances on the Development of Evaluation Measures

Traditional Experiment

Results

Judges

Search Engines

How many good docs have I missed/found? 15

Page 16: Advances on the Development of Evaluation Measures

Depth-k Pooling

. . . .

1

2

3

z

k

sys2

. . . . . .

C

D

M

A

T

sys3

. . . . . .

A

E

D

F

B

sysM

. . . . . .

B

A

S

Z

L

sys1

. . . . . .

A

B

C

X

Y

Judge Documents

16

Page 17: Advances on the Development of Evaluation Measures

Depth-k Pooling

. . . .

1

2

3

z

k

sys2

. . . . . .

C

D

M

A

T

sys3

. . . . . .

A

E

D

F

B

sysM

. . . . . .

B

A

S

Z

L

sys1

. . . . . .

A

B

C

X

Y

Judge

17

Page 18: Advances on the Development of Evaluation Measures

Depth-k Pooling

. . . .

1

2

3

z

k

sys2

. . . . . .

C

D

M

A

T

sys3

. . . . . .

A

E

D

F

B

sysM

. . . . . .

B

A

S

Z

L

sys1

. . . . . .

A

B

C

X

Y

. . . .

1

2

3

z

k

sys2

. . . . . .

N

N

R

R

?

sys3

. . . . . .

R

N

N

R

R

sysM

. . . . . .

R

R

N

N

?

sys1

. . . . . .

R

R

N

N

?

Judge

18

Page 19: Advances on the Development of Evaluation Measures

Depth-k Pooling

. . . .

1

2

3

z

k

sys2

. . . . . .

C

D

M

A

T

sys3

. . . . . .

A

E

D

F

B

sysM

. . . . . .

B

A

S

Z

L

sys1

. . . . . .

A

B

C

X

Y

. . . .

1

2

3

z

k

sys2

. . . . . .

N

N

R

R

N

sys3

. . . . . .

R

N

N

R

R

sysM

. . . . . .

R

R

N

N

N

sys1

. . . . . .

R

R

N

N

N

Judge

19

Page 20: Advances on the Development of Evaluation Measures

Reusable Test Collections

• Document Corpus

• Topics

• Relevance Judgments

Topic 1 Topic 2 Topic N

20

Page 21: Advances on the Development of Evaluation Measures

Evaluation Metrics: Precision vs Recall

Retrieved list

R

N

R

N

N

R

N

N

N

R . .

1

2

3

4

5

6

7

8

9

10 . .

Page 22: Advances on the Development of Evaluation Measures

Visualizing Retrieval Performance: Precision-Recall Curves

R

N

R

N

N

R

N

N

N

R

List:

Page 23: Advances on the Development of Evaluation Measures

Evaluation Metrics: Average Precision

R

N

R

N

N

R

N

N

N

R

List:

Page 24: Advances on the Development of Evaluation Measures

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

24

Page 25: Advances on the Development of Evaluation Measures

User models Behind Traditional Metrics

• Precision@k

– Users always look at top k documents

– What fraction of the top k documents are relevant?

• Recall

– Users would like to find all the relevant documents.

– What fraction of these documents have been retrieved by the search engine?

Page 26: Advances on the Development of Evaluation Measures

User Model of Average Precision (Robertson ‘08)

1. User steps down a ranked list one-by-one

2. Stops browsing documents due to satisfaction

– stops with a certain probability after observing a relevant document

3. Gains utility from each relevant document

Page 27: Advances on the Development of Evaluation Measures

User Model of Average Precision (Robertson ‘08)

• The probability that the user stops browsing is uniform over all the relevant documents

• The utility a user gains when he stops browsing at a relevant document at rank n (precision at rank n)

• AP can be written as:

U(n) 1

nrel(k)

k1

n

1

)()(n

nUnPAP

o.w. 0 relevant, is doc if 1

)(R

nP

Page 28: Advances on the Development of Evaluation Measures

User Model Based Evaluation Measures

• Directly aim at evaluating user satisfaction

– An effectiveness measure should be correlated to the user’s experience

• Thus interest in effectiveness measures based on explicit models of user interaction

– Devise a user model correlated with user behavior

– Infer an evaluation metric from the user model

Page 29: Advances on the Development of Evaluation Measures

Basic User Model

• Simple model of user interaction:

1. User steps down ranked results one-by-one

2. Stops at a document at rank k with some probability P(k)

3. Gains some utility U(k) from relevant documents

M U(k)P(k)k1

Page 30: Advances on the Development of Evaluation Measures

Basic User Model

1. Discount: What is the chance a user will visit a document?

– Model of the browsing behavior

2. Utility: What does the user gain by visiting a document?

Page 31: Advances on the Development of Evaluation Measures

Model Browsing Behavior

Position-based models

The chance of observing a document depends on the position

it is presented in the ranked list.

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

Page 32: Advances on the Development of Evaluation Measures

Rank Biased Precision black powder

ammunition

1

2

3

4

5

6

7

8

9

10

Query

Stop View Next

Item

Page 33: Advances on the Development of Evaluation Measures

Rank Biased Precision black powder

ammunition

1

2

3

4

5

6

7

8

9

10

)-(1= RBP1=i

1

i

irel

Page 34: Advances on the Development of Evaluation Measures

Discounted Cumulative Gain black powder

ammunition

1

2

3

4

5

6

7

8

9

10

HR

R

N

N

HR

R

N

R

N

N

Discounted Gain

3

0.63

0

0

1.14

0.35

0

0.31

0

0

1/log2(r+1)

Relevance Score

2

1

0

0

2

1

0

1

0

0

Discount by rank

Relevance Gain

3

1

0

0

3

1

0

1

0

0

12 rrel

5.46

Page 35: Advances on the Development of Evaluation Measures

Discounted Cumulative Gain

• DCG can be written as:

• Discount function models the probability that the user visits (clicks on) the document at rank r

– Currently, P(user clicks on doc r) = 1/log2(r+1)

r

N

r

UtilityrdocvisitsuserP

) (1

Page 36: Advances on the Development of Evaluation Measures

Discounted Cumulative Gain

• Instead of stopping probability, think about viewing probability

• This fits in discounted gain model framework:

Page 37: Advances on the Development of Evaluation Measures

Normalised Discounted Cumulative Gain

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

HR

R

N

N

HR

R

N

R

N

N

Discounted Gain

3

0.63

0

0

1.14

0.35

0

0.31

0

0

1/log2(r+1)

Relevance Score

2

1

0

0

2

1

0

1

0

0

Discount by rank

Relevance Gain

3

1

0

0

3

1

0

1

0

0

12 rrel

optDCG

DCGNDCG

Page 38: Advances on the Development of Evaluation Measures

Model Browsing Behavior

Cascade-based models

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

• The user views search results from top to bottom

• At each rank i, the user has a certain probability of being satisfied. • Probability of satisfaction proportional to the

relevance grade of the document at rank i.

• Once the user is satisfied with a document, he terminates the search.

Page 39: Advances on the Development of Evaluation Measures

Rank Biased Precision

Query

Stop

View Next Item

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

Page 40: Advances on the Development of Evaluation Measures

Expected Reciprocal Rank [Chapelle et al CIKM09]

Query

Stop

Relevant?

View Next Item

no somewhat highly

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

Page 41: Advances on the Development of Evaluation Measures

Expected Reciprocal Rank [Chapelle et al CIKM09]

black powder

ammunition

rrank at

document"perfect the" finding of Utility :(r)

1/r (r)

)position at stopsuser (1

1

rPr

ERRn

r

1

11

)1(1 r

i

ri

n

r

RRr

ERR

document r theof grade relevance : th

rg

) positionat stopsP(user 2

12 doc of relevance of Prob.

maxrRr

g

g

r

r

1

2

3

4

5

6

7

8

9

10

Page 42: Advances on the Development of Evaluation Measures

Metrics derived from Query Logs

• Use the query logs to understand how users behave

• Learn the parameters of the user model from the query logs

– Utility, discount, etc.

Page 43: Advances on the Development of Evaluation Measures

Metrics derived from Query Logs

• Users tend to stop search if they are satisfied or frustrated

• P(observe a doc at rank r) highly affected by snippet quality

Relevance P(C|R)

Bad 0.50

Fair 0.49

Good 0.45

Excellent 0.59

Perfect 0.79

Relevance P(Stop|R)

Bad 0.49

Fair 0.41

Good 0.37

Excellent 0.53

Perfect 0.76

Page 44: Advances on the Development of Evaluation Measures

Metrics derived from Query Logs

• Users behave differently for different queries

– Informational queries

– Navigational queries

Navigational Informational

P(C|R) P(Stop|R) P(C|R) P(Stop|R)

Bad 0.632 0.587 0.516 0.431

Fair 0.569 0.523 0.455 0.357

Good 0.526 0.483 0.442 0.349

Excellent 0.700 0.669 0.533 0.458

Perfect 0.809 0.786 0.557 0.502

Page 45: Advances on the Development of Evaluation Measures

DEBU (r) P(Er ) P(C | Rr )

EBU DEBU (r)r 1

n

Rr

Expected Browsing Utility (Yilmaz et al. CIKM’10)

Page 46: Advances on the Development of Evaluation Measures

Basic User Model

1. Discount: What is the chance a user will visit a document?

– Model of the browsing behavior

2. Utility: What does the user gain by visiting a document?

– Mostly ad-hoc, no clear user model

Page 47: Advances on the Development of Evaluation Measures

Graded Average Precision (Robertson et al. SIGIR’10)

• One document is more useful than another • One possible meaning:

– one document is useful to more users than another

• Hence the following: – assume grades of relevance... ... but that user has a threshold relevance grade

which defines a binary view

– different users have different thresholds

described by a probability distribution over users

Page 48: Advances on the Development of Evaluation Measures

Graded Average Precision [Robertson et al. SIGIR10]

• User has binary view of relevance

– by thresholding the relevance scale

Considered relevant with probability g1

Irrelevant

Relevant

Highly Relevant

Relevance Scale

Page 49: Advances on the Development of Evaluation Measures

Graded Average Precision [Robertson et al. SIGIR10]

• User has binary view of relevance

– by thresholding the relevance scale

Irrelevant

Relevant

Highly Relevant

Relevance Scale

Considered relevant with probability g2

Page 50: Advances on the Development of Evaluation Measures

Graded Average Precision

• Assume relevance grades {0...c} – 0 for non-relevant, + c positive grades

• gi = P(user threshold is at i) for i ∈ {1...c} – i.e. user regards grades {i...c} as relevant, grades {0...(i-1)}

as not relevant – gis sum to one

• Step down the ranked list, stopping at documents that may be relevant – then calculate expected precision at each of these

(expected over the population of users)

Page 51: Advances on the Development of Evaluation Measures

Graded Average Precision (GAP)

1 HR

2 R

3 N

4 N

5 R

6 HR

7 R

Relevance

Page 52: Advances on the Development of Evaluation Measures

1 HR

2 R

3 N

4 N

5 R

6 HR

7 R

Relevance

1 Rel

2 Rel

3 N

4 N

5 Rel

6 Rel

7 Rel

with prob.

g1

6

4prec6

Graded Average Precision (GAP)

Page 53: Advances on the Development of Evaluation Measures

1 HR

2 R

3 N

4 N

5 R

6 HR

7 R

Relevance

1 Rel

2 N

3 N

4 N

5 N

6 Rel

7 N

with prob.

g2

6

2prec6

Graded Average Precision (GAP)

Page 54: Advances on the Development of Evaluation Measures

1 HR

2 R

3 NR

4 NR

5 R

6 HR

7 R

Relevance

wprec6 4

6 g1

2

6 g2

Graded Average Precision (GAP)

Page 55: Advances on the Development of Evaluation Measures

Probability Models

• Almost all the measures we’ve discussed are based on probabilistic models of users

– Most have one or more parameters representing something about user behavior

– Is there a way to incorporate variability in the user population?

• How do we estimate parameter values?

– Is a single point estimate good enough?

Page 56: Advances on the Development of Evaluation Measures

Choosing Parameter Values

• Parameter θ models a user – Higher θ more patience, more results viewed

– Lower θ less patience, fewer results viewed

• Different approaches: – Minimize variance in evaluation (Kanoulas & Aslam,

CIKM ‘09)

– Use click log; fit a model to gaps between clicks (Zhang et al., IRJ, 2010)

– All try to infer a single value for the parameters

Page 57: Advances on the Development of Evaluation Measures

Distribution of “Patience” for RBP

• Form a distribution P(θ)

• Sampling from P(θ) is like sampling a “user” defined by their patience

• How can we form a proper distribution of θ?

• Idea: mine logged search engine user data – Look at ranks users are clicking

– Estimate patience based on absence or presence of clicks

Page 58: Advances on the Development of Evaluation Measures

Modeling Patience from Log Data

• We will assume we have a flat prior θ that we want to update using log data L

• Decompose L into individual search sessions

– For each session q, count:

• cq, the total number of clicks

• rq, the total number of no-clicks

– Model cq with a negative binomial distribution conditional on rq and θ:

Page 59: Advances on the Development of Evaluation Measures

Modeling Patience from Log Data

• Marginalize P(θ|L) over r:

• Apply Bayes’ rule to P(θ | r, L):

• P(L | θ, r) is the likelihood of the observed clicks

Page 60: Advances on the Development of Evaluation Measures

Complete Model Expression

• Model components result in three equations to estimate P(θ | L)

Page 61: Advances on the Development of Evaluation Measures

Empirical Patience Profiles: Navigational Queries

Page 62: Advances on the Development of Evaluation Measures

Empirical Patience Profiles: Informational Queries

Page 63: Advances on the Development of Evaluation Measures

Extend to ERR Parameters

Page 64: Advances on the Development of Evaluation Measures

Evaluation Using Parameter Distributions

• Monte Carlo procedure:

– Sample a parameter value from P(θ | L)

• Or a vector of values for ERR

– Compute the measure with the sampled value

– Iterate to form distribution P(RBP) or P(ERR)

Page 65: Advances on the Development of Evaluation Measures

Marginal Distribution Analysis

• S1=[R N N N N N N N N N]

• S2=[N R R R R R R R R R]

Page 66: Advances on the Development of Evaluation Measures

Distribution of RBP

Page 67: Advances on the Development of Evaluation Measures

Distribution of ERR

Page 68: Advances on the Development of Evaluation Measures

Marginal Distribution Analysis

• Given two systems, over all choices of θ

– What is P(M1 > M2)?

– What is P((M1 - M2)>t)?

Page 69: Advances on the Development of Evaluation Measures

Marginal Distribution Analysis

Page 70: Advances on the Development of Evaluation Measures

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

71

Page 71: Advances on the Development of Evaluation Measures

Why sessions?

• Current evaluation framework – Assesses the effectiveness of systems over one-

shot queries

• Users reformulate their initial query

• Still fine if … – optimizing system for one-shot queries led to

optimal performance over an entire session

Page 72: Advances on the Development of Evaluation Measures

When was the DuPont Science Essay Contest created?

Initial Query : DuPont Science Essay Contest

Reformulation : When was the DSEC created?

• e.g. retrieval systems should accumulate information along a session

Why sessions?

Page 73: Advances on the Development of Evaluation Measures

Paris Luxurious Hotels Paris Hilton J Lo Paris

Page 74: Advances on the Development of Evaluation Measures

Extend the evaluation framework

From one query evaluation

To multi-query sessions evaluation

Page 75: Advances on the Development of Evaluation Measures

Construct appropriate test collections

Rethink of evaluation measures

Page 76: Advances on the Development of Evaluation Measures

• A set of information needs A friend from Kenya is visiting you and you'd like to surprise him with

by cooking a traditional swahili dish. You would like to search online to decide which dish you will cook at home.

– A static sequence of m queries

Basic test collection

Initial Query : 1st Reformulation : 2nd Reformulation : … (m-1)th Reformulation :

kenya cooking traditional

kenya cooking traditional swahili

www.allrecipes.com

kenya swahili traditional food recipes

Page 77: Advances on the Development of Evaluation Measures

Basic Test Collection

Factual/Amorphous, Known-item search

Intellectual/Amorphous, Explanatory search

Factual/Amorphous, Known-item search

Page 78: Advances on the Development of Evaluation Measures

Experiment

kenya cooking

traditional swahili

kenya swahili

traditional food

recipes

kenya cooking

traditional

1

2

3

4

5

6

7

8

9

10

Page 79: Advances on the Development of Evaluation Measures

Experiment

1

2

3

4

5

6

7

8

9

10

kenya cooking

traditional swahili

kenya cooking

traditional

kenya swahili

traditional food

recipes

Page 80: Advances on the Development of Evaluation Measures

Construct appropriate test collections

Rethink of evaluation measures

Page 81: Advances on the Development of Evaluation Measures

What is a good system?

Page 82: Advances on the Development of Evaluation Measures

How can we measure “goodness”?

Page 83: Advances on the Development of Evaluation Measures

Measuring “goodness”

The user steps down a ranked list of documents and

observes each one of them until a decision point and either

a) abandons the search, or

b) reformulates

While stepping down or sideways, the user accumulates utility

Page 84: Advances on the Development of Evaluation Measures

What are the challenges?

Page 85: Advances on the Development of Evaluation Measures

Evaluation over a single ranked list

1

2

3

4

5

6

7

8

9

10

kenya cooking

traditional swahili

kenya cooking

traditional

kenya swahili

traditional food

recipes

Page 86: Advances on the Development of Evaluation Measures
Page 87: Advances on the Development of Evaluation Measures

Session DCG [Järvelin et al ECIR 2008]

kenya cooking

traditional swahili

kenya cooking

traditional

2rel(r ) 1

logb (r b 1)r1

k

2rel(r ) 1

logb (r b 1)r1

k

1

logc (1 c 1)DCG(RL1)

1

logc(2 c 1) DCG(RL2)

Page 88: Advances on the Development of Evaluation Measures

Session Metrics

• Session DCG [Järvelin et al ECIR 2008]

The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]

Page 89: Advances on the Development of Evaluation Measures

Model-based measures

Probabilistic space of users following

different paths

• Ω is the space of all paths

• P(ω) is the prob of a user following a path ω in Ω

• U(ω) is the utility of path ω in Ω

[Yang and Lad ICTIR 2009]

P()U()

Page 90: Advances on the Development of Evaluation Measures

Expected Global Utility [Yang and Lad ICTIR 2009]

1. User steps down ranked results one-by-one

2. Stops browsing documents based on a stochastic process that defines a stopping probability distribution over ranks and reformulates

3. Gains something from relevant documents, accumulating utility

Page 91: Advances on the Development of Evaluation Measures

Expected Global Utility [Yang and Lad ICTIR 2009]

• The probability of a user following a path ω:

P(ω) = P(r1, r2, ..., rK)

ri is the stopping and reformulation point in list i

– Assumption: stopping positions in each list are independent

P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)

– Use geometric distribution (RBP) to model the stopping and reformulation behaviour

P(ri = r) = (1-) k1

Page 92: Advances on the Development of Evaluation Measures

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

Geo

met

ric

w/

par

amet

er θ

Expected Global Utility

Page 93: Advances on the Development of Evaluation Measures

Session Metrics

• Session DCG [Järvelin et al ECIR 2008]

The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]

• Expected global utility [Yang and Lad ICTIR 2009]

The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]

Page 94: Advances on the Development of Evaluation Measures

Model-based measures

Probabilistic space of users following

different paths

• Ω is the space of all paths

• P(ω) is the prob of a user following a path ω in Ω

• Mω is a measure over a path ω

[Kanoulas et al. SIGIR 2011]

esM P()M

Page 95: Advances on the Development of Evaluation Measures

Probability of a path

Probability of abandoning at reform 2

X

Probability of reformulating at rank 3

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

(1) (2)

Page 96: Advances on the Development of Evaluation Measures

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Probability of abandoning the session at reformulation i

Geometric w/ parameter preform

(1)

Page 97: Advances on the Development of Evaluation Measures

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Truncated Geometric w/ parameter preform

Probability of abandoning the session at reformulation i

(1)

Page 98: Advances on the Development of Evaluation Measures

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Truncated Geometric w/ parameter preform

Geo

met

ric

w/

par

amet

er p

do

wn

Probability of reformulating

at rank j (2)

Page 99: Advances on the Development of Evaluation Measures

Session Metrics

• Session DCG [Järvelin et al ECIR 2008]

The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]

• Expected global utility [Yang and Lad ICTIR 2009]

The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]

• Expected session measures [Kanoulas et al. SIGIR 2011]

The user steps down a ranked list of documents until a decision point and either abandons the query or reformulates [Stochastic; allows early abandonment]

Page 100: Advances on the Development of Evaluation Measures

Outline

• Intro to evaluation

– Different approaches to evaluation

– Traditional evaluation measures

• User model based evaluation measures

• Session Evaluation

• Novelty and Diversity

101

Page 101: Advances on the Development of Evaluation Measures

Novelty

• The redundancy problem:

– the first relevant document contains some useful information

– every document with the same information after that is worth less to the user

• but worth the same to traditional evaluation measures

• Novelty retrieval attempts to ensure that ranked results do not have much redundancy

Page 102: Advances on the Development of Evaluation Measures

Example

• query: “oil-producing nations” – members of OPEC

– North Atlantic nations

– South American nations

• 10 relevant articles about OPEC probably not as useful as one relevant article about each group – And one relevant article about all oil-producing

nations might be even better

Page 103: Advances on the Development of Evaluation Measures

How to Evaluate?

• One approach:

– List subtopics, aspects, or facets of the topic

– Judge each document relevant or not to each possible subtopic

• For oil-producing nations, subtopics could be names of nations

– Saudi Arabia, Russia, Canada, …

Page 104: Advances on the Development of Evaluation Measures

Subtopic Relevance Example

Page 105: Advances on the Development of Evaluation Measures

Evaluation Measures

• Subtopic recall and precision (Zhai et al., 2003) – Subtopic recall at rank k:

• Count unique subtopics in top k documents • Divide by total number of known unique subtopics

– Subtopic precision at recall r: • Find least k at which subtopic recall r is achieved • Find least k at which subtopic recall r could possibly be

achieved (by a perfect system) • Divide latter by former

– Models a user that wants all subtopics • and doesn’t care about redundancy as long as they are

seeing new information

Page 106: Advances on the Development of Evaluation Measures

Subtopic Relevance Evaluation

Copyright © Ben Carterette

Page 107: Advances on the Development of Evaluation Measures

Diversity

• Short keyword queries are inherently ambiguous

– An automatic system can never know the user’s intent

• Diversification attempts to retrieve results that may be relevant to a space of possible intents

Page 108: Advances on the Development of Evaluation Measures

Evaluation Measures

• Subtopic recall and precision

– This time with judgments to “intents” rather than subtopics

• Measures that know about intents:

– “Intent-aware” family of measures (Agrawal et al.)

– D, D♯ measures (Sakai et al.)

– α-nDCG (Clarke et al.)

– ERR-IA (Chapelle et al.)

Page 109: Advances on the Development of Evaluation Measures

Intent-Aware Measures

• Assume there is a probability distribution P(i | Q) over intents for a query Q

– Probability that a randomly-sampled user means intent i when submitting query Q

• The intent-aware version of a measure is its weighted average over this distribution

Page 110: Advances on the Development of Evaluation Measures

P@10-IA = 0.35*0.3 + 0.35*0.3 + 0.2*0.2 + 0.08*0.1 + 0.02*0.1 = 0.23

Page 111: Advances on the Development of Evaluation Measures

D-measure

• Take the idea of intent-awareness and apply it to computing document gain

– The gain for a document is the (weighted) average of its gains for subtopics it is relevant to

• D-nDCG is nDCG computed using intent-aware gains

Page 112: Advances on the Development of Evaluation Measures

D-DCG = 0.35/log 2 + 0.35/log 3 + …

Page 113: Advances on the Development of Evaluation Measures

α-nDCG

• α-nDCG is a generalization of nDCG that accounts for both novelty and diversity

• α is a geometric penalization for redundancy

– Redefine the gain of a document:

• +1 for each subtopic it is relevant to

• ×(1-α) for each document higher in the ranking that subtopic already appeared in

• Discount is the same as usual

Page 114: Advances on the Development of Evaluation Measures

+1

+1

+1

+1

+(1-α)

+1

+(1-α)

+(1-α)

+(1-α)2

+(1-α)2

Page 115: Advances on the Development of Evaluation Measures

ERR-IA

• Intent-aware version of ERR

• But it has appealing properties other IA measures do not have:

– ranges between 0 and 1

– submodularity: diminishing returns for relevance to a given subtopic -> built-in redundancy penalization

• Also has appealing properties over α-nDCG:

– Easily handles graded subtopic judgments

– Easily handles intent distributions

Page 116: Advances on the Development of Evaluation Measures

Granularity of Judging

• What exactly is a “subtopic”?

– Perhaps any piece of information a user may be interested in finding?

• At what granularity should subtopics be defined?

– For example:

• “cardinals” has many possible meanings

• “cardinals baseball team” is still very broad

• “cardinals baseball team schedule” covers 6 months

• “cardinals baseball team schedule august” covers ~25 games

• “cardinals baseball team schedule august 12th”

Page 117: Advances on the Development of Evaluation Measures

Preference Judgments for Novelty

• What about evaluating novelty with no subtopic judgments?

• Preference judgments:

– Is document A more relevant than document B?

• Conditional preference judgments:

– Is document A better than document B given that I’ve just seen document C?

– Assumption: preference is based on novelty over C

• Is it true? Come to our presentation on Wednesday…

Page 118: Advances on the Development of Evaluation Measures
Page 119: Advances on the Development of Evaluation Measures

Conclusions

• Strong interest in using evaluation measures to model user behavior and satisfaction – Driven by availability of user logs, increased

computational power, good abstract models

– DCG, RBP, ERR, EBU, session measures, diversity measures all model users in different ways

• Cranfield-style evaluation is still important!

• But there is still much to understand about users and how they derive satisfaction

Page 120: Advances on the Development of Evaluation Measures

Conclusions

• Ongoing and future work: – Models with more degrees of freedom

– Direct simulation of users from start of session to finish

– Application to other domains

• Thank you! – Slides will be available online

– http://ir.cis.udel.edu/SIGIR12tutorial