Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher...

Evaluation of IR Systems

Adapted from Lectures by

Prabhakar Raghavan (Google) and

Christopher Manning (Stanford)

Prasad 1L10Evaluation

2

This lecture

Results summaries: Making our good results presentable and usable to

a user

How do we know if our results are any good? Evaluating a search engine

Benchmarks Precision and recall

Prasad L10Evaluation

3

Result Summaries

Having ranked the documents matching a query, we wish to present a results list.

Most commonly, a list of document titles plus a short summary, aka “10 blue links”.

4

Summaries

The title is typically automatically extracted from document metadata. What about the summaries?

This description is crucial. User can identify good/relevant hits based on description.

Two basic kinds: A static summary of a document is always the same,

regardless of the query that hit the doc. A dynamic summary is a query-dependent attempt to

explain why the document was retrieved for the query.


Static summaries

In typical systems, the static summary is a subset of the document. Simplest heuristic: the first 50 (or so – this can be

varied) words of the document Summary cached at indexing time

More sophisticated: extract from each document a set of “key” sentences

Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.

Most sophisticated: NLP used to synthesize a summary

Seldom used in IR (cf. text summarization work)

5

Dynamic summaries

Present one or more “windows” within the document that contain several of the query terms

“KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring

If query found as a phrase, all or some occurrences of the phrase in the doc

If not, doc windows that contain multiple query terms

The summary itself gives the entire content of the window – all terms, not only the query terms.

7

Generating dynamic summaries

If we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.

If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.

E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content.

Most often, cache only a fixed-size prefix of the doc. Note: Cached copy can be outdated


8

Dynamic summaries

Producing good dynamic summaries is a tricky optimization problem. The real estate for the summary is normally small

and fixed. Want snippets to be long enough to be useful. Want linguistically well-formed snippets. Want snippets maximally informative about doc.

But users really like snippets, even if they complicate IR system design.


Alternative results presentations?

An active area of HCI research An alternative: http://www.searchme.com / copies the

idea of Apple’s Cover Flow for search results


http://www.searchme.com/

Evaluating search engines


11

Measures for a search engine How fast does it index?

e.g., number of bytes per hour How fast does it search?

e.g., latency as a function of queries per second What is the cost per query?

in dollars

All of the preceding criteria are measurable: we can quantify speed / size / money.

However, the key measure for a search engine is user happiness.


12

Data Retrieval vs Information Retrieval

DR Performance Evaluation (after establishing correctness) Response time Index space …

IR Performance Evaluation How relevant is the answer set? How happy are

the users? (Required to establish “functional correctness”, e.g., through benchmarks)


13

Measures for a search engine

What is user happiness? Factors include:

Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free)

None of these is sufficient: blindingly fast, but useless answers won’t make a user happy.

How can we quantify user happiness?


14

Measuring user happiness: Who is the user?

Web search engine: searcher. Success: Searcher finds what was looked for. Measure: rate of return to this search engine

Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate

Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers

Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold

Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company


15

Happiness: elusive to measure

Most common proxy: relevance of search results

Standard Methodology in IR: Relevance measurement requires 3 elements:

1. A benchmark document collection

2. A benchmark suite of queries

3. An assessment of relevance for each query-document pair Some work on binary relevance, others use

multi-valued relevance (or partial orders)


Evaluating an IR system

Note: the information need is translated into a query

Relevance is assessed relative to the information need, not the query E.g., Information need: I'm looking for information

on whether drinking red wine is more effective at reducing heart attack risks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has these words.

Prasad L10Evaluation 16

Evaluating an IR system

Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.”

Query q: [red wine white wine heart attack]

Consider document d′: At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.

d′ is an excellent match for query q . . . d′ is not relevant to the information need i .

Prasad L10Evaluation 17

18

Difficulties with gauging Relevancy

Relevancy, from a human standpoint, is: Subjective: Depends upon a specific

user’s judgment. Situational: Relates to user’s current

needs. Cognitive: Depends on human

perception and behavior. Dynamic: Changes over time.


Standard relevance benchmarks

TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years

Reuters and other benchmark doc collections used

“Retrieval tasks” specified sometimes as queries

Human experts mark, for each query and for each doc, Relevant or Nonrelevant or at least for subset of docs that some system

returned for that queryPrasad 19

20

Unranked retrieval evaluation:Precision and Recall

Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp (true positive) fp (false positive)

Not Retrieved fn (false negative) tn (true negative)


21

Precision and Recall


22

Precision and Recall in Practice

Precision The ability to retrieve top-ranked documents that

are mostly relevant. The fraction of the retrieved documents that are relevant.

Recall The ability of the search to find all of the relevant

items in the corpus. The fraction of the relevant documents that are retrieved.


Introduction to Information RetrievalIntroduction to Information Retrieval

23

Accuracy Why do we use complex measures like precision,

recall, etc? Why not something simple like accuracy? Accuracy is the fraction of decisions

(relevant/nonrelevant) that are correct. In terms of the contingency table above,

accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web

information retrieval?

23

24

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.


25

Precision/Recall

You can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong

empirical confirmation


26

Trade-offs

10

1

Recall

Pre

cisi

on

The idealReturns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includeslot of junk


27

Difficulties in using precision/recall

Should average over large document collection/query ensembles

Need human relevance assessments People aren’t reliable assessors

Assessments have to be binary Nuanced assessments?

Heavily skewed by collection/authorship Results may not translate from one domain to

another


28

A combined measure: F

Combined measure that assesses precision/recall tradeoff is F measure (harmonic mean):

Harmonic mean is a conservative average See CJ van Rijsbergen, Information Retrieval

RP

PR

RP

F

2

112


29

Aka E Measure (parameterized F Measure)

Variants of F measure that allow weighting emphasis on precision over recall:

Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.

PRRP

PRE

1

2

2

2

2

)1()1(


30

F1 and other averages

Combined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic



31

F: Example

P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4

31

relevant not relevant

retrieved 20 40 60

not retrieved 60 1,000,000 1,000,060

80 1,000,040 1,000,120


32

Exercise Compute precision, recall and F1 for this

result set:

32

relevant not relevant

retrieved 18 2

not retrieved 82 1,000,000,000

Recall vs Precision and F1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

an

d F

1

Breakeven Point

Breakeven point is the point where precision equals recall.

Alternative single measure of IR effectiveness.

How do you compute it?

34

Evaluating ranked results

Precision/recall/F are measures for unranked sets.

We can easily turn set measures into measures of ranked lists.

Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc. results

Doing this for precision and recall gives you a precision-recall curve.


35

R=3/6=0.5; P=3/4=0.75

Computing Recall/Precision Points: An Example

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

Let total # of relevant docs = 6Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document.

Never reach 100% recall

L10Evaluation

36

A precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

37

Averaging over queries

A precision-recall graph for one query isn’t a very sensible thing to look at.

You need to average performance over a whole bunch of queries.

But there’s a technical issue: Precision-recall calculations place some points on

the graph How do you determine a value (interpolate)

between the points?

Sec. 8.4

38

Interpolated precision

Idea: If locally precision increases with increasing recall, then you should get to count that…

So you take the max of precisions to right of value

Sec. 8.4

39

Interpolating a Recall/Precision Curve

Interpolate a precision value for each standard recall level: rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}

r0 = 0.0, r1 = 0.1, …, r10=1.0

The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level above the j-th level:

)(max)( rPrPrr

jj

40

Interpolated precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

41

Evaluation Metrics (cont’d)

Graphs are good, but people want summary measures! Precision at fixed retrieval level

Precision-at-k: Precision of top k results Perhaps appropriate for web search: all people want are

good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k

11-point interpolated average precision The standard measure in the early TREC competitions: you

take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

Evaluates performance at all recall levels


42

Typical (good) 11 point precisions

SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

Prasad


43

11-point interpolated average precision

11-point average: ≈ 0.425

How can precisionat 0.0 be > 0?

43

Recall InterpolatedPrecision

0.00.10.20.30.40.50.60.70.80.9 1.0

1.00 0.67 0.63 0.55 0.45 0.41 0.36 0.29 0.130.10 0.08

44

11 point precisions

0

20

40

60

80

100

120

0 20 40 60 80 100 120

Recall

Pre

cisi

on


Receiver Operating Characteristics (ROC) Curve

True positive rate =

tp/(tp+fn) = recall = sensitivity

False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1

Why is the blue line “worthless”?


46

Variance of measures like precision/recall

For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.1).

Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query.

That is, there are easy information needs and hard ones.

46

47

Mean average precision (MAP)

MAP for a query Average of the precision value for each (of

the k top) relevant document retrieved This approach weights early appearance of a relevant

document over later appearance

MAP for query collection is the mean (arithmetic average) of AP for each query Macro-averaging: each query counts

equally


Average Precision

Mean Average Precision (MAP) Mean Average Precision (MAP)

summarize rankings from multiple queries by taking mean (averaging) of average precision

most commonly used measure in research papers

assumes user is interested in finding many relevant documents for each query

requires many binary relevance judgments in text collection

51

Summarize a Ranking: MAP

Given that n docs are retrieved Compute the precision (at rank) where each

(new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs

E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. If a relevant document never gets retrieved, we

assume the precision corresponding to that rel. doc to be zero

Compute the average over all the relevant documents Average precision = (p(1)+…p(k))/k

52

(cont’d)

This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document

Mean Average Precisions (MAP) MAP = arithmetic mean average

precision over a set of topics gMAP = geometric mean average

precision over a set of topics (more affected by difficult topics)

Discounted Cumulative Gain

Popular measure for evaluating web search and related tasks.

Two assumptions: Highly relevant documents are more useful

than marginally relevant document. The lower the ranked position of a relevant

document, the less useful it is for the user, since it is less likely to be examined.

Discounted Cumulative Gain

Uses graded relevance as a measure of usefulness, or gain, from examining a document

Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2,

and at rank 8 it is 1/3

55

Summarize a Ranking: DCG

What if relevance judgments are in a scale of [1,r]? r>2

Cumulative Gain (CG) at rank n Let the ratings of the n documents be r1, r2, …

rn (in ranked order) CG = r1+r2+…rn

Discounted Cumulative Gain (DCG) at rank n DCG = r1 + r2/log22 + r3/log23 + … rn/log2n

We may use any base for the logarithm, e.g., base=b

Discounted Cumulative Gain DCG is the total gain accumulated at a particular

rank p:

Alternative formulation:

used by some web search companies emphasis on retrieving highly relevant documents

DCG Example

10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0

discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0

= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG:

3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61

58

Summarize a Ranking: NDCG

Normalized Cumulative Gain (NDCG) at rank n Normalize DCG at rank n by the DCG value at

rank n of the ideal ranking The ideal ranking would first return the

documents with the highest relevance level, then the next highest relevance level, etc

Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs

NDCG is now quite popular in evaluating Web search

NDCG - Example

iGround Truth Ranking Function1 Ranking Function2

Document Order

riDocument

Orderri

Document Order

ri

1 d4 2 d3 2 d3 2

2 d3 2 d4 2 d2 1

3 d2 1 d2 1 d4 2

4 d1 0 d1 0 d1 0

NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203

6309.44log

0

3log

1

2log

22

222

GTDCG

6309.44log

0

3log

1

2log

22

2221

RFDCG

2619.44log

0

3log

2

2log

12

2222

RFDCG

6309.4 GTDCGMaxDCG

4 documents: d1, d2, d3, d4

Graded ranking/ordering:

DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) = 6.5

IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) = 6.63

NDCG = DCG/IDCG = 6.5/6.63 = .98

NDCG (at 4) - Example

1024

60

61

R- Precision

Precision at the R-th position in the ranking of results for a query that has R relevant documents.

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

L10Evaluation

62

Test Collections

Prasad

Creating Test Collectionsfor IR Evaluation



64

What we need for a benchmark A collection of documents

Documents must be representative of the documents we expect to see in reality.

A collection of information needs . . .which we will often incorrectly refer to as queries Information needs must be representative of the information

needs we expect to see in reality. Human relevance assessments

We need to hire/pay “judges” or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in

reality.64


65

Standard relevance benchmark: Cranfield

Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness

Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225

queries, exhaustive relevance judgments of all query-document-pairs

Too small, too untypical for serious IR evaluation today

65


66

Standard relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and

Technology (NIST) TREC is actually a set of several different relevance

benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations

between 1992 and 1999 1.89 million documents, mainly newswire articles, 450

information needs No exhaustive relevance judgments – too expensive Rather, NIST assessors’ relevance judgments are available

only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 66


67

Standard relevance benchmarks: Others GOV2

Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what

Google/Yahoo/MSN index NTCIR

East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF)

This evaluation series has concentrated on European languages and cross-language information retrieval.

Many others67


68

Validity of relevance assessments

Relevance assessments are only usable if they are consistent.

If they are not consistent, then there is no “truth” and experiments are not repeatable.

How can we measure this consistency or agreement among judges?

→ Kappa measure

68

69

Kappa measure for inter-judge (dis)agreement

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

P(A) – proportion of time judges agree P(E) – what agreement would be by chance Kappa = 0 for chance agreement, 1 for total agreement.


Kappa Measure: Example

Number of docs Judge 1 Judge 2

300 Relevant Relevant

70 Nonrelevant Nonrelevant

20 Relevant Nonrelevant

10 Nonrelevant Relevant

P(A)? P(E)?

Kappa Example

P(A) = 370/400 = 0.925 P(nonrelevant) = (10+20+70+70)/800 = 0.2125 P(relevant) = (10+20+300+300)/800 = 0.7878 P(E) = 0.2125^2 + 0.7878^2 = 0.665 Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

Kappa > 0.8 : good agreement 0.67< Kappa <0.8 : “tentative conclusions” (Carletta ’96) Depends on purpose of study

For >2 judges: average pairwise kappas 71

Kappa Example : Alternative view

Both judges score non-relevant randomly: 80/400 * 90/400 = 0.045

Both judges score relevant randomly: 320/400 * 310/400 = 0.62

Both judges agree = 0.045 + 0.62 = 0.665 Both judges disagree: 320/400 * 90/400 +

310/400 * 80/400 = 0.18 + 0.155 = 0.335 P(E) = 0.665 / (0.665 + 0.335) = 0.665

72

Evaluation at large search engines

Search engines have test collections of queries and hand-ranked results

Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10 . . . or measures that reward you more for getting rank 1

right than for getting rank 10 right. NDCG (Normalized Cumulative Discounted Gain)

Search engines also use non-relevance-based measures. Clickthrough on first result

Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.

Studies of user behavior in the lab A/B testing 73L10Evaluation

A/B testing

Purpose: Test a single innovation Prerequisite: You have a large search engine

up and running.

Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new

system that includes the innovation Evaluate with an “automatic” measure like clickthrough

on first result Now we can directly see if the innovation does improve

user happiness. Probably the evaluation methodology that large search

engines trust most

74

75

SKIP DETAILS


76

Other Evaluation Measures

Adapted from Slides Attributed to

Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)


77

Fallout Rate

Problems with both precision and recall: Number of irrelevant documents in the

collection is not taken into account. Recall is undefined when there is no relevant

document in the collection. Precision is undefined when no document is

retrieved.

collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.

Fallout


78

Subjective Relevance Measure

Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware.

Ability to find new information on a topic. Coverage Ratio: The proportion of relevant items retrieved

out of the total relevant documents known to a user prior to the search.

Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000).


79

Other Factors to Consider

User effort: Work required from the user in formulating queries, conducting the search, and screening the output.

Response time: Time interval between receipt of a user query and the presentation of system responses.

Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials.

Collection coverage: Extent to which any/all relevant items are included in the document corpus.


80

Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart)

Collection Number Of Number Of Raw Size Name Documents Queries (Mbytes) CACM 3,204 64 1.5 CISI 1,460 112 1.3 CRAN 1,400 225 1.6 MED 1,033 30 1.1 TIME 425 83 1.5

Different researchers used different test collections and evaluation techniques.

Early Test Collections


81

Critique of pure relevance

Relevance vs Marginal Relevance A document can be redundant even if it is highly

relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for

the user. Using facts/entities as evaluation units more

directly measures true relevance. But harder to create evaluation set


Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher...

Documents

Transcript of Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher...