Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher...
-
Upload
jesse-norman -
Category
Documents
-
view
237 -
download
18
Transcript of Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher...
![Page 1: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/1.jpg)
Evaluation of IR Systems
Adapted from Lectures by
Prabhakar Raghavan (Google) and
Christopher Manning (Stanford)
Prasad 1L10Evaluation
![Page 2: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/2.jpg)
2
This lecture
Results summaries: Making our good results presentable and usable to
a user
How do we know if our results are any good? Evaluating a search engine
Benchmarks Precision and recall
Prasad L10Evaluation
![Page 3: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/3.jpg)
3
Result Summaries
Having ranked the documents matching a query, we wish to present a results list.
Most commonly, a list of document titles plus a short summary, aka “10 blue links”.
![Page 4: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/4.jpg)
4
Summaries
The title is typically automatically extracted from document metadata. What about the summaries?
This description is crucial. User can identify good/relevant hits based on description.
Two basic kinds: A static summary of a document is always the same,
regardless of the query that hit the doc. A dynamic summary is a query-dependent attempt to
explain why the document was retrieved for the query.
Prasad L10Evaluation
![Page 5: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/5.jpg)
Static summaries
In typical systems, the static summary is a subset of the document. Simplest heuristic: the first 50 (or so – this can be
varied) words of the document Summary cached at indexing time
More sophisticated: extract from each document a set of “key” sentences
Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR (cf. text summarization work)
5
![Page 6: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/6.jpg)
Dynamic summaries
Present one or more “windows” within the document that contain several of the query terms
“KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring
If query found as a phrase, all or some occurrences of the phrase in the doc
If not, doc windows that contain multiple query terms
The summary itself gives the entire content of the window – all terms, not only the query terms.
![Page 7: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/7.jpg)
7
Generating dynamic summaries
If we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.
If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.
E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content.
Most often, cache only a fixed-size prefix of the doc. Note: Cached copy can be outdated
Prasad L10Evaluation
![Page 8: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/8.jpg)
8
Dynamic summaries
Producing good dynamic summaries is a tricky optimization problem. The real estate for the summary is normally small
and fixed. Want snippets to be long enough to be useful. Want linguistically well-formed snippets. Want snippets maximally informative about doc.
But users really like snippets, even if they complicate IR system design.
Prasad L10Evaluation
![Page 9: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/9.jpg)
Alternative results presentations?
An active area of HCI research An alternative: http://www.searchme.com / copies the
idea of Apple’s Cover Flow for search results
Prasad 9L10Evaluation
![Page 10: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/10.jpg)
Evaluating search engines
Prasad 10L10Evaluation
![Page 11: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/11.jpg)
11
Measures for a search engine How fast does it index?
e.g., number of bytes per hour How fast does it search?
e.g., latency as a function of queries per second What is the cost per query?
in dollars
All of the preceding criteria are measurable: we can quantify speed / size / money.
However, the key measure for a search engine is user happiness.
Prasad L10Evaluation
![Page 12: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/12.jpg)
12
Data Retrieval vs Information Retrieval
DR Performance Evaluation (after establishing correctness) Response time Index space …
IR Performance Evaluation How relevant is the answer set? How happy are
the users? (Required to establish “functional correctness”, e.g., through benchmarks)
Prasad L10Evaluation
![Page 13: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/13.jpg)
13
Measures for a search engine
What is user happiness? Factors include:
Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free)
None of these is sufficient: blindingly fast, but useless answers won’t make a user happy.
How can we quantify user happiness?
Prasad L10Evaluation
![Page 14: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/14.jpg)
14
Measuring user happiness: Who is the user?
Web search engine: searcher. Success: Searcher finds what was looked for. Measure: rate of return to this search engine
Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate
Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers
Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold
Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company
Prasad L10Evaluation
![Page 15: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/15.jpg)
15
Happiness: elusive to measure
Most common proxy: relevance of search results
Standard Methodology in IR: Relevance measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. An assessment of relevance for each query-document pair Some work on binary relevance, others use
multi-valued relevance (or partial orders)
Prasad L10Evaluation
![Page 16: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/16.jpg)
Evaluating an IR system
Note: the information need is translated into a query
Relevance is assessed relative to the information need, not the query E.g., Information need: I'm looking for information
on whether drinking red wine is more effective at reducing heart attack risks than white wine.
Query: wine red white heart attack effective You evaluate whether the doc addresses the
information need, not whether it has these words.
Prasad L10Evaluation 16
![Page 17: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/17.jpg)
Evaluating an IR system
Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.”
Query q: [red wine white wine heart attack]
Consider document d′: At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving.
d′ is an excellent match for query q . . . d′ is not relevant to the information need i .
Prasad L10Evaluation 17
![Page 18: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/18.jpg)
18
Difficulties with gauging Relevancy
Relevancy, from a human standpoint, is: Subjective: Depends upon a specific
user’s judgment. Situational: Relates to user’s current
needs. Cognitive: Depends on human
perception and behavior. Dynamic: Changes over time.
Prasad L10Evaluation
![Page 19: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/19.jpg)
Standard relevance benchmarks
TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years
Reuters and other benchmark doc collections used
“Retrieval tasks” specified sometimes as queries
Human experts mark, for each query and for each doc, Relevant or Nonrelevant or at least for subset of docs that some system
returned for that queryPrasad 19
![Page 20: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/20.jpg)
20
Unranked retrieval evaluation:Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp (true positive) fp (false positive)
Not Retrieved fn (false negative) tn (true negative)
Prasad L10Evaluation
![Page 21: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/21.jpg)
21
Precision and Recall
Prasad L10Evaluation
![Page 22: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/22.jpg)
22
Precision and Recall in Practice
Precision The ability to retrieve top-ranked documents that
are mostly relevant. The fraction of the retrieved documents that are relevant.
Recall The ability of the search to find all of the relevant
items in the corpus. The fraction of the relevant documents that are retrieved.
Prasad L10Evaluation
![Page 23: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/23.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
23
Accuracy Why do we use complex measures like precision,
recall, etc? Why not something simple like accuracy? Accuracy is the fraction of decisions
(relevant/nonrelevant) that are correct. In terms of the contingency table above,
accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web
information retrieval?
23
![Page 24: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/24.jpg)
24
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget….
People doing information retrieval want to find something and have a certain tolerance for junk.
Search for:
0 matching results found.
Prasad L10Evaluation
![Page 25: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/25.jpg)
25
Precision/Recall
You can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong
empirical confirmation
Prasad L10Evaluation
![Page 26: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/26.jpg)
26
Trade-offs
10
1
Recall
Pre
cisi
on
The idealReturns relevant documents butmisses many useful ones too
Returns most relevantdocuments but includeslot of junk
Prasad L10Evaluation
![Page 27: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/27.jpg)
27
Difficulties in using precision/recall
Should average over large document collection/query ensembles
Need human relevance assessments People aren’t reliable assessors
Assessments have to be binary Nuanced assessments?
Heavily skewed by collection/authorship Results may not translate from one domain to
another
Prasad L10Evaluation
![Page 28: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/28.jpg)
28
A combined measure: F
Combined measure that assesses precision/recall tradeoff is F measure (harmonic mean):
Harmonic mean is a conservative average See CJ van Rijsbergen, Information Retrieval
RP
PR
RP
F
2
112
Prasad L10Evaluation
![Page 29: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/29.jpg)
29
Aka E Measure (parameterized F Measure)
Variants of F measure that allow weighting emphasis on precision over recall:
Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.
PRRP
PRE
1
2
2
2
2
)1()1(
Prasad L10Evaluation
![Page 30: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/30.jpg)
30
F1 and other averages
Combined Measures
0
20
40
60
80
100
0 20 40 60 80 100
Precision (Recall fixed at 70%)
Minimum
Maximum
Arithmetic
Geometric
Harmonic
Prasad L10Evaluation
![Page 31: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/31.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
31
F: Example
P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4
31
relevant not relevant
retrieved 20 40 60
not retrieved 60 1,000,000 1,000,060
80 1,000,040 1,000,120
![Page 32: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/32.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
32
Exercise Compute precision, recall and F1 for this
result set:
32
relevant not relevant
retrieved 18 2
not retrieved 82 1,000,000,000
![Page 33: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/33.jpg)
Recall vs Precision and F1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
an
d F
1
Breakeven Point
Breakeven point is the point where precision equals recall.
Alternative single measure of IR effectiveness.
How do you compute it?
![Page 34: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/34.jpg)
34
Evaluating ranked results
Precision/recall/F are measures for unranked sets.
We can easily turn set measures into measures of ranked lists.
Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc. results
Doing this for precision and recall gives you a precision-recall curve.
Prasad L10Evaluation
![Page 35: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/35.jpg)
35
R=3/6=0.5; P=3/4=0.75
Computing Recall/Precision Points: An Example
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
Let total # of relevant docs = 6Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one relevant document.
Never reach 100% recall
L10Evaluation
![Page 36: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/36.jpg)
36
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
![Page 37: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/37.jpg)
37
Averaging over queries
A precision-recall graph for one query isn’t a very sensible thing to look at.
You need to average performance over a whole bunch of queries.
But there’s a technical issue: Precision-recall calculations place some points on
the graph How do you determine a value (interpolate)
between the points?
Sec. 8.4
![Page 38: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/38.jpg)
38
Interpolated precision
Idea: If locally precision increases with increasing recall, then you should get to count that…
So you take the max of precisions to right of value
Sec. 8.4
![Page 39: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/39.jpg)
39
Interpolating a Recall/Precision Curve
Interpolate a precision value for each standard recall level: rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
r0 = 0.0, r1 = 0.1, …, r10=1.0
The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level above the j-th level:
)(max)( rPrPrr
jj
![Page 40: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/40.jpg)
40
Interpolated precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
![Page 41: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/41.jpg)
41
Evaluation Metrics (cont’d)
Graphs are good, but people want summary measures! Precision at fixed retrieval level
Precision-at-k: Precision of top k results Perhaps appropriate for web search: all people want are
good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k
11-point interpolated average precision The standard measure in the early TREC competitions: you
take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them
Evaluates performance at all recall levels
Prasad L10Evaluation
![Page 42: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/42.jpg)
42
Typical (good) 11 point precisions
SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cis
ion
Prasad
![Page 43: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/43.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
43
11-point interpolated average precision
11-point average: ≈ 0.425
How can precisionat 0.0 be > 0?
43
Recall InterpolatedPrecision
0.00.10.20.30.40.50.60.70.80.9 1.0
1.00 0.67 0.63 0.55 0.45 0.41 0.36 0.29 0.130.10 0.08
![Page 44: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/44.jpg)
44
11 point precisions
0
20
40
60
80
100
120
0 20 40 60 80 100 120
Recall
Pre
cisi
on
Prasad L10Evaluation
![Page 45: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/45.jpg)
Receiver Operating Characteristics (ROC) Curve
True positive rate =
tp/(tp+fn) = recall = sensitivity
False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1
Why is the blue line “worthless”?
![Page 46: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/46.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
46
Variance of measures like precision/recall
For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.1).
Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query.
That is, there are easy information needs and hard ones.
46
![Page 47: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/47.jpg)
47
Mean average precision (MAP)
MAP for a query Average of the precision value for each (of
the k top) relevant document retrieved This approach weights early appearance of a relevant
document over later appearance
MAP for query collection is the mean (arithmetic average) of AP for each query Macro-averaging: each query counts
equally
Prasad L10Evaluation
![Page 48: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/48.jpg)
Average Precision
![Page 49: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/49.jpg)
Mean Average Precision (MAP) Mean Average Precision (MAP)
summarize rankings from multiple queries by taking mean (averaging) of average precision
most commonly used measure in research papers
assumes user is interested in finding many relevant documents for each query
requires many binary relevance judgments in text collection
![Page 50: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/50.jpg)
MAP
![Page 51: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/51.jpg)
51
Summarize a Ranking: MAP
Given that n docs are retrieved Compute the precision (at rank) where each
(new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. If a relevant document never gets retrieved, we
assume the precision corresponding to that rel. doc to be zero
Compute the average over all the relevant documents Average precision = (p(1)+…p(k))/k
![Page 52: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/52.jpg)
52
(cont’d)
This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document
Mean Average Precisions (MAP) MAP = arithmetic mean average
precision over a set of topics gMAP = geometric mean average
precision over a set of topics (more affected by difficult topics)
![Page 53: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/53.jpg)
Discounted Cumulative Gain
Popular measure for evaluating web search and related tasks.
Two assumptions: Highly relevant documents are more useful
than marginally relevant document. The lower the ranked position of a relevant
document, the less useful it is for the user, since it is less likely to be examined.
![Page 54: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/54.jpg)
Discounted Cumulative Gain
Uses graded relevance as a measure of usefulness, or gain, from examining a document
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2,
and at rank 8 it is 1/3
![Page 55: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/55.jpg)
55
Summarize a Ranking: DCG
What if relevance judgments are in a scale of [1,r]? r>2
Cumulative Gain (CG) at rank n Let the ratings of the n documents be r1, r2, …
rn (in ranked order) CG = r1+r2+…rn
Discounted Cumulative Gain (DCG) at rank n DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
We may use any base for the logarithm, e.g., base=b
![Page 56: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/56.jpg)
Discounted Cumulative Gain DCG is the total gain accumulated at a particular
rank p:
Alternative formulation:
used by some web search companies emphasis on retrieving highly relevant documents
![Page 57: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/57.jpg)
DCG Example
10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0
discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG:
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
![Page 58: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/58.jpg)
58
Summarize a Ranking: NDCG
Normalized Cumulative Gain (NDCG) at rank n Normalize DCG at rank n by the DCG value at
rank n of the ideal ranking The ideal ranking would first return the
documents with the highest relevance level, then the next highest relevance level, etc
Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
NDCG is now quite popular in evaluating Web search
![Page 59: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/59.jpg)
NDCG - Example
iGround Truth Ranking Function1 Ranking Function2
Document Order
riDocument
Orderri
Document Order
ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
6309.44log
0
3log
1
2log
22
222
GTDCG
6309.44log
0
3log
1
2log
22
2221
RFDCG
2619.44log
0
3log
2
2log
12
2222
RFDCG
6309.4 GTDCGMaxDCG
4 documents: d1, d2, d3, d4
![Page 60: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/60.jpg)
Graded ranking/ordering:
DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) = 6.5
IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) = 6.63
NDCG = DCG/IDCG = 6.5/6.63 = .98
NDCG (at 4) - Example
1024
60
![Page 61: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/61.jpg)
61
R- Precision
Precision at the R-th position in the ranking of results for a query that has R relevant documents.
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
L10Evaluation
![Page 62: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/62.jpg)
62
Test Collections
Prasad
![Page 63: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/63.jpg)
Creating Test Collectionsfor IR Evaluation
Prasad 63L10Evaluation
![Page 64: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/64.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
64
What we need for a benchmark A collection of documents
Documents must be representative of the documents we expect to see in reality.
A collection of information needs . . .which we will often incorrectly refer to as queries Information needs must be representative of the information
needs we expect to see in reality. Human relevance assessments
We need to hire/pay “judges” or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in
reality.64
![Page 65: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/65.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
65
Standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness
Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225
queries, exhaustive relevance judgments of all query-document-pairs
Too small, too untypical for serious IR evaluation today
65
![Page 66: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/66.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
66
Standard relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and
Technology (NIST) TREC is actually a set of several different relevance
benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations
between 1992 and 1999 1.89 million documents, mainly newswire articles, 450
information needs No exhaustive relevance judgments – too expensive Rather, NIST assessors’ relevance judgments are available
only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 66
![Page 67: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/67.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
67
Standard relevance benchmarks: Others GOV2
Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what
Google/Yahoo/MSN index NTCIR
East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF)
This evaluation series has concentrated on European languages and cross-language information retrieval.
Many others67
![Page 68: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/68.jpg)
Introduction to Information RetrievalIntroduction to Information Retrieval
68
Validity of relevance assessments
Relevance assessments are only usable if they are consistent.
If they are not consistent, then there is no “truth” and experiments are not repeatable.
How can we measure this consistency or agreement among judges?
→ Kappa measure
68
![Page 69: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/69.jpg)
69
Kappa measure for inter-judge (dis)agreement
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
P(A) – proportion of time judges agree P(E) – what agreement would be by chance Kappa = 0 for chance agreement, 1 for total agreement.
Prasad L10Evaluation
![Page 70: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/70.jpg)
Kappa Measure: Example
Number of docs Judge 1 Judge 2
300 Relevant Relevant
70 Nonrelevant Nonrelevant
20 Relevant Nonrelevant
10 Nonrelevant Relevant
P(A)? P(E)?
![Page 71: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/71.jpg)
Kappa Example
P(A) = 370/400 = 0.925 P(nonrelevant) = (10+20+70+70)/800 = 0.2125 P(relevant) = (10+20+300+300)/800 = 0.7878 P(E) = 0.2125^2 + 0.7878^2 = 0.665 Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
Kappa > 0.8 : good agreement 0.67< Kappa <0.8 : “tentative conclusions” (Carletta ’96) Depends on purpose of study
For >2 judges: average pairwise kappas 71
![Page 72: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/72.jpg)
Kappa Example : Alternative view
Both judges score non-relevant randomly: 80/400 * 90/400 = 0.045
Both judges score relevant randomly: 320/400 * 310/400 = 0.62
Both judges agree = 0.045 + 0.62 = 0.665 Both judges disagree: 320/400 * 90/400 +
310/400 * 80/400 = 0.18 + 0.155 = 0.335 P(E) = 0.665 / (0.665 + 0.335) = 0.665
72
![Page 73: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/73.jpg)
Evaluation at large search engines
Search engines have test collections of queries and hand-ranked results
Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10 . . . or measures that reward you more for getting rank 1
right than for getting rank 10 right. NDCG (Normalized Cumulative Discounted Gain)
Search engines also use non-relevance-based measures. Clickthrough on first result
Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.
Studies of user behavior in the lab A/B testing 73L10Evaluation
![Page 74: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/74.jpg)
A/B testing
Purpose: Test a single innovation Prerequisite: You have a large search engine
up and running.
Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new
system that includes the innovation Evaluate with an “automatic” measure like clickthrough
on first result Now we can directly see if the innovation does improve
user happiness. Probably the evaluation methodology that large search
engines trust most
74
![Page 75: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/75.jpg)
75
SKIP DETAILS
Prasad L10Evaluation
![Page 76: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/76.jpg)
76
Other Evaluation Measures
Adapted from Slides Attributed to
Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Prasad L10Evaluation
![Page 77: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/77.jpg)
77
Fallout Rate
Problems with both precision and recall: Number of irrelevant documents in the
collection is not taken into account. Recall is undefined when there is no relevant
document in the collection. Precision is undefined when no document is
retrieved.
collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.
Fallout
Prasad L10Evaluation
![Page 78: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/78.jpg)
78
Subjective Relevance Measure
Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware.
Ability to find new information on a topic. Coverage Ratio: The proportion of relevant items retrieved
out of the total relevant documents known to a user prior to the search.
Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000).
Prasad L10Evaluation
![Page 79: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/79.jpg)
79
Other Factors to Consider
User effort: Work required from the user in formulating queries, conducting the search, and screening the output.
Response time: Time interval between receipt of a user query and the presentation of system responses.
Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials.
Collection coverage: Extent to which any/all relevant items are included in the document corpus.
Prasad L10Evaluation
![Page 80: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/80.jpg)
80
Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart)
Collection Number Of Number Of Raw Size Name Documents Queries (Mbytes) CACM 3,204 64 1.5 CISI 1,460 112 1.3 CRAN 1,400 225 1.6 MED 1,033 30 1.1 TIME 425 83 1.5
Different researchers used different test collections and evaluation techniques.
Early Test Collections
Prasad L10Evaluation
![Page 81: Evaluation of IR Systems Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad1L10Evaluation.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649e6a5503460f94b68169/html5/thumbnails/81.jpg)
81
Critique of pure relevance
Relevance vs Marginal Relevance A document can be redundant even if it is highly
relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for
the user. Using facts/entities as evaluation units more
directly measures true relevance. But harder to create evaluation set
Prasad L10Evaluation