ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make...

35
ISP 433/633 Week 6 IR Evaluation
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make...

Page 1: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

ISP 433/633 Week 6

IR Evaluation

Page 2: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments

Page 3: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

What to Evaluate?

• How much of the information need is satisfied.

• How much was learned about a topic.

• Incidental learning:– How much was learned about the

collection.– How much was learned about other topics.

• How inviting the system is.

Page 4: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Relevance

• In what ways can a document be relevant to a query?– Answer precise question precisely.– Partially answer question.– Suggest a source for more information.– Give background information.– Remind the user of other knowledge.– Others ...

Page 5: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Relevance

• How relevant is the document– for this user for this information need.

• Subjective, but• Measurable to some extent

– How often do people agree a document is relevant to a query

• How well does it answer the question?– Complete answer? Partial? – Background Information?– Hints for further exploration?

Page 6: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Get me stuff I want

Two considerations:

• Accuracy: Did it give me only stuff I wanted from the collection, or did it give me junk too?

• Comprehensiveness: Did it give me all the stuff I wanted from the collection, or did it miss some?

Page 7: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Relevant vs. Retrieved

Relevant

Retrieved

All docs

Page 8: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

After Query…

Retrieved?

Yes

No

Relevant ?Yes No

Hits False Alarms

Misses Correct Rejections

Page 9: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Precision vs. Recall

• Accuracy: Did it give me only stuff I wanted from the collection, or did it give me junk too? – Hits vs. False Alarms – Standard measure:

Precision = |Hits| / |Retrieved|

• Comprehensiveness: Did it give me all the stuff I wanted from the collection, or did it miss some?– Hits vs. Misses – Standard measure

Recall = |Hits| / |Relevant|

Retrieved?

Yes

No

Relevant?

Yes No

Hits

Misses

False Alarms

Correct Rejections

Page 10: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Retrieved vs. Relevant

Relevant

Very high precision, very low recall

Page 11: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Retrieved vs. Relevant

Relevant

Very low precision, very low recall (0 in fact)

Page 12: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Retrieved vs. Relevant

Relevant

High recall, but low precision

Page 13: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Retrieved vs. Relevant

Relevant

High precision, high recall

Page 14: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Precision/Recall Curves

•There is a tradeoff between Precision and Recall

•So measure Precision at different levels of Recall

•Note: this is an AVERAGE over MANY queries

precision

recall

x

x

x

x

Page 15: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

After a single Query

Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123} : 10 Relevant

1. d123*

2. d84

3. d56*4. d6

5. d8

6. d9*7. d511

8. d129

9. d187

10. d25*11. d38

12. d48

13. d250

14. d113

15. d3*

• First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100%

• Next Relevant gives us 66% Precision at 20% recall level

• ….

Page 16: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Graph for a Single Query

100 90 80 70 60 50 40 30 20 10 0

0 10 20 30 40 50 60 70 80 90 100

PRECISION

RECALL

Page 17: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Averaging Multiple Queries

query th - for the level Recallat Precision theis

queries ofnumber theis

level Recallat Precision average theis

1

irrP

N

rrP

N

rPrP

i

q

N

i q

iq

Page 18: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

11 standard recall level

• 0%, 10%, 20%, …, 100%

• May need interpolation

Page 19: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Interpolation

Rq={d3,d56,d129}

1. d123*

2. d84

3. d56*4. d6

5. d8

6. d9*7. d511

8. d129

9. d187

10. d25*11. d38

12. d48

13. d250

14. d113

15. d3*

• First relevant doc is 56, which is gives recall and precision of 33.3%

• Next Relevant (129) gives us 66% recall at 25% precision

• Next (3) gives us 100% recall with 20% precision

• How do we figure out the precision at the 11 standard recall levels?

Page 20: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Interpolation

th-)1( theandth - ebetween th level recall

anyat Precision known Maximum The I.e.,

)(max

level recall standardth - the toreference a is

10,...,2,1,0,

1

jj

rPrrrrP

j

jr

jjj

j

Page 21: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Interpolation

• So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3%

• At recall levels 40%, 50%, and 60% interpolated precision is 25%

• And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20%

• Giving graph…

Page 22: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Interpolated precision/recall

100 90 80 70 60 50 40 30 20 10 0

0 10 20 30 40 50 60 70 80 90 100

PRECISION

RECALL

Page 23: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Document Cutoff Levels

• Another way to evaluate:– Fix the number of documents retrieved at several levels:

• top 5• top 10• top 20• top 50• top 100• top 500

– Measure precision at each of these levels– Take (weighted) average over results

• This is a way to focus on how well the system ranks the first k documents.

Page 24: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Problem with Precision/Recall

• Can’t know true recall value – except in small collections

• Precision/Recall are related– A combined measure sometimes more

appropriate

• Assumes batch mode– Interactive IR is important and has different

criteria for successful searches

• Assumes a strict rank ordering matters.

Page 25: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Precision/Recall Curves

• Difficult to determine which of these two hypothetical results is better:

precision

recall

x

x

x

x

Page 26: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

The E-Measure

Combine Precision and Recall into one number (van Rijsbergen 79)

PRb

bE

1

11 2

2

P = precisionR = recallb = measure of relative importance of P or R

For example,b = 0.5 means user is twice as interested in

precision as recall

)1/(1

1)1(

11

1

2

RP

E

Page 27: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

User-Oriented Measures

||

|Re|cov

AllKnown

wnntrievedKnoerage

|Re|

|Re|

trievedAll

nowntrievedUnknovelty

Page 28: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Reference Collections

• Need consistent benchmarks for evaluation• TREC - Text REtrieval

Conference/Competition– Run by NIST (National Institute of Standards &

Technology)

– Hold each year since 1992– Queries + Relevance Judgments

• Queries devised and judged by “Information Specialists”• Relevance judgments done only for those documents

retrieved -- not entire collection!• Results judged on precision and recall, going up to a

recall level of 1000 documents

Page 29: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Page 30: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Page 31: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Page 32: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Sample TREC queries (topics)<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

Page 33: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Page 34: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

TREC

• Benefits:– made research systems scale to large collections

(pre-WWW)– allows for somewhat controlled comparisons

• Drawbacks:– emphasis on high recall, which may be unrealistic

for what most users want– very long queries, also unrealistic– comparisons still difficult to make, because

systems are quite different on many dimensions– focus on batch ranking rather than interaction

• There is an interactive track.

Page 35: ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Some Observations

• Relatively low recall is sometimes acceptable– In a study, the system retrieved less than 20% of

the relevant documents for a particular information need; users thought they had 75%, and were satisfied

• users can’t foresee exact words and phrases that will indicate relevant documents

• “accident” referred to by those responsible as:

“event,” “incident,” “situation,” “problem,” …• differing technical terminology• slang, misspellings