Special Topics on Information Retrieval

37
Special Topics on Information Retrieval Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010.

description

Special Topics on Information Retrieval. Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010. Introduction. Content of the section. Definition of the task The vector space model Performance evaluation - PowerPoint PPT Presentation

Transcript of Special Topics on Information Retrieval

Page 1: Special Topics on Information Retrieval

Special Topics onInformation Retrieval

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

University of Alabama at Birmingham, Fall 2010.

Page 2: Special Topics on Information Retrieval

Introduction

Page 3: Special Topics on Information Retrieval

Special Topics on Information Retrieval3

Content of the section

• Definition of the task • The vector space model • Performance evaluation • Main problems and basic solutions– Query expansion– Relevance feedback– Clustering (documents or results)

Page 4: Special Topics on Information Retrieval

Special Topics on Information Retrieval4

Initial questions

• What is an information retrieval system?• What is its goal?• What is inside it? Any sub-procceses?• How to evaluate its performance?• Why results are not always relevant?

Page 5: Special Topics on Information Retrieval

Special Topics on Information Retrieval5

First definition

“Information retrieval (IR) embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, techniques, or machines are employed to carry out the operation”

Calvin Mooers, 1951

Page 6: Special Topics on Information Retrieval

Special Topics on Information Retrieval6

General scheme of the IR process

Task

InfoNeed

Query

Results

Conception

Formulation

Search RefinementCorpus

User

Page 7: Special Topics on Information Retrieval

Special Topics on Information Retrieval7

More definitions

• IR deals with the representation, storage, organization of, and access to information items.

R. Baeza-Yates and B. Ribeiro-Neto, 1999

• The task of an IR system is to retrieve documents or texts with information content that is relevant to a user’s information need.

Spark Jones & Willett, 1997

Page 8: Special Topics on Information Retrieval

Special Topics on Information Retrieval8

Typical IR system

Preprocessing Storing

IndexQuerying

Retrieving

Query

Results

DocumentCollection

Indexing Retrieval IR Model

Page 9: Special Topics on Information Retrieval

Special Topics on Information Retrieval9

Vector space model

• Documents are represented as vectors in a N-dimensional space– N is the number of terms in the collection– Term is different than word

• Query is treated as any other document• Relevance – measured by similarity:– A document is relevant to the query if its vector is

similar to the query’s vector .

Page 10: Special Topics on Information Retrieval

Special Topics on Information Retrieval10

Preprocessing

• Eliminate information about style, such as html or xml tags.– For some applications this information may be useful. For

instance, only index some document sections.• Remove stop words– Functional words such as articles, prepositions,

conjunctions are not useful (do not have an own meaning).

• Perform stemming or lemmatization– The goal is to reduce inflectional forms, and sometimes

derivationally related forms. am, are, is → be car, cars, car‘s → car

Page 11: Special Topics on Information Retrieval

Special Topics on Information Retrieval11

Representation

t1 t1 … tn

d1

d2

: wi,j

dm

All documents(one vector per document)

Weight indicating the contributionof term j in document i.

Whole vocabulary of the collection(all different terms)

Page 12: Special Topics on Information Retrieval

Special Topics on Information Retrieval12

Term weighting - two main ideas

• The importance of a term increases proportionally to the number of times it appears in the document.– It helps to describe document’s content.

• The general importance of a term decreases proportionally to its occurrences in the entire collection.– Common terms are not good to distinguish

relevant from non-relevant documents

Page 13: Special Topics on Information Retrieval

Special Topics on Information Retrieval13

Term weighting – main approaches• Binary weights: – wi,j = 1 iff document di contains term tj , otherwise 0.

• Term frequency (tf):– wi,j = (no. of occurrences of tj in di)

• tf x idf weighting scheme:– wi,j = tf(tj, di) × idf(tj), where:

• tf(tj, di) indicates the ocurrences of tj in document di

• idf(tj) = log [N/df(tj)], where df(tj) is the number of

documets that contain the term tj.

Page 14: Special Topics on Information Retrieval

Special Topics on Information Retrieval14

Similarity measure

• Relevance – similarity between document’s vectors and the query’s vector.

• Measured by means of the cosene measure.– The closer the vectors (small their angle), the greater the

document similarity.

iid

iiq

iidiq

ww

ww

dqdqdqsim

22||||),(

i

j

a1

d1

qd2 a2

Page 15: Special Topics on Information Retrieval

Special Topics on Information Retrieval15

Vector space model − Pros & Cons• Pros– Easily explain– Mathematically sound– Approximate query matching

• Cons– Need term weighting– Hard to model structured queries– Normalization increases computational costs

• Most commonly used IR model; it is considered superior to others due to its simplicity and elegancy.

Page 16: Special Topics on Information Retrieval

Special Topics on Information Retrieval16

Other IR models

• Boolean model (±1950)• Document similarity (±1957)• Probabilistic indexing (±1960)• Vector space model (±1970)• Probabilistic retrieval (±1976)• Fuzzy set models (±1980)• Inference networks (±1992)• Language models (±1998)

Page 17: Special Topics on Information Retrieval

Special Topics on Information Retrieval17

IR evaluation

• Why is evaluation important?• Which characteristics we need to evaluate?• How can we evaluate the performance of IR

systems?– Given several systems, which one is the best?

• What things (resources) are necessary to evaluate an IR system?

• Is IR evaluation subjective or objective?

Page 18: Special Topics on Information Retrieval

Special Topics on Information Retrieval18

Several perspectives

• In order to answer "How well does the system work?“, we can investigated several options:– Processing: Time and space efficiency– Search: Effectiveness of results– System: Satisfaction of the user

• We will focus on evaluating retrieval effectiveness,– How to measure the other aspects?

Page 19: Special Topics on Information Retrieval

Special Topics on Information Retrieval19

Difficulties in evaluating an IR system

• Effectiveness is related to the relevancy of retrieved items.– Relevancy is not typically binary but continuous.– Even if relevancy is binary, it can be a difficult judgment to

make.• Relevancy, from a human standpoint, is:– Subjective: Depends upon a specific user’s judgment.– Situational: Relates to user’s current needs.– Cognitive: Depends on human perception and behavior.– Dynamic: Changes over time.

Page 20: Special Topics on Information Retrieval

Special Topics on Information Retrieval20

Main requirements

• It is necessary to have a test collection– A lot of documents (the bigger the better)– Several queries– Relevance judgments for all queries• Binary assessment of either relevant or non relevant for

each query-document pair.

• Methods/systems must be evaluated using the same evaluation measure.

Constructing a test collection requires considerable human effort

Page 21: Special Topics on Information Retrieval

Special Topics on Information Retrieval21

Standard test collections

• TREC (Text Retrieval Conference)– National Institute of Standards and Technology– In total, 1.89 million documents and relevance

judgments for 450 information needs• CLEF (Cross Language Evaluation Forum)– This evaluation series has concentrated on

European languages and cross-language information retrieval

– Last Adhoc English monolingual IR task: 169,477 documents and 50 queries.

Page 22: Special Topics on Information Retrieval

Special Topics on Information Retrieval22

Retrieval effectivenessIn response to a query, an IR system searches a document collection and returns a ordered list of responses.

• Measure the quality of a set/list of responses– a better search strategy yields a better result list– Better result lists help the user fill their

information need• Two kinds of measures:– set based and ranked-list based

Page 23: Special Topics on Information Retrieval

Special Topics on Information Retrieval23

Set based measures

Collection

QueryRelevant

DocumentsRetrievedDocuments

RelevantRetrieved

documentsResults

documents relevant of number Totalretrieved documents relevant of Number recall

retrieved documents of number Totalretrieved documents relevant of Number precision

Page 24: Special Topics on Information Retrieval

24

Precision, recall and F-measure

• Precision (P)– The ability to retrieve top-ranked documents that are

mostly relevant.• Recall (R)– The ability of the search to find all of the relevant

items in the corpus.• F-measure (F)– Harmonic mean of recall and precision

RPPRF

2

Page 25: Special Topics on Information Retrieval

Special Topics on Information Retrieval25

Ranked-list based measures

• Average Recall/Precision Curve– Plots average precision at each standard recall

level across all queries.• MAP (mean average precision)– Provides a single-figure measure of quality across

recall levels• R-prec– Precision at the R-th position in the ranking of

results for a query that has R relevant documents

Page 26: Special Topics on Information Retrieval

Special Topics on Information Retrieval26

Recall/Precision Curve(from Mooney’s IR course at the University of Texas at Austin)

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

NoStem Stem

What is the curve of an ideal system?

Page 27: Special Topics on Information Retrieval

Special Topics on Information Retrieval27

MAP• Average precision is the average of the precision scores

at the rank locations of each relevant document.

• Mean Average Precision is the mean of the Average Precision scores for a group of queries.

documentsrelevant

ireliPAP

N

i

_1

N is the number of retrieved documents, P(i) is the precision of the first i documents, and rel(i) is a binary function indicating if document at i-position is relevant or not.

Page 28: Special Topics on Information Retrieval

Special Topics on Information Retrieval28

Illustrative example(from IR course of Northeastern University, College of Computer and Information Science)

MAP1 = 0.622

MAP2 = 0.52

Page 29: Special Topics on Information Retrieval

Special Topics on Information Retrieval29

Common problems

• Why not all retrieved documents are relevant?,

why it is too difficult to get 100% of precision?

– Consider the query “jaguar”

• Why it is complex to retrieve all relevant

documents (get 100% of recall)?

– Consider the query “religion”

• What to do in order to tackle these problems?

Page 30: Special Topics on Information Retrieval

Special Topics on Information Retrieval30

Query expansion

• It is the process of adding terms to a user’s (weighted) query.

• Its goal is to improve precision and/or recall.• Example: – User Query: “car”– Expanded Query: “car cars automobile

automobiles auto” etc…• How to do it? Ideas?

Page 31: Special Topics on Information Retrieval

Special Topics on Information Retrieval31

Main approaches

1. By means of a thesaurus– Thesauri may be manually or automatically

constructed.2. By means of (user) relevance feedback3. Automatic query expansion– Local query expansion (blind feedback)– Global query expansion (using word

associations)

Page 32: Special Topics on Information Retrieval

Special Topics on Information Retrieval32

Thesaurus-based query expansion

A thesaurus provides information on synonyms and semantically related words

• Expansion procedure: For each term t in a query, expand the query with synonyms and related words of t.

• Generally increases recall.• May significantly decrease precision, particularly with

ambiguous terms.– “interest rate” “interest rate fascinate evaluate”

Page 33: Special Topics on Information Retrieval

Special Topics on Information Retrieval33

Relevance feedback

• Basic procedure:1. The user creates their initial query which returns

an initial result set.2. The user selects a list of documents that are

relevant to their search.3. The system then re-weights and/or expands the

query based upon the terms in the documents• Significant improvement in recall and

precision over early query expansion work

Page 34: Special Topics on Information Retrieval

Special Topics on Information Retrieval34

Standard Rochio Method

• The idea is to move the query in direction closer to the relevant documents, and farther away from the irrelevant ones.

njrj Dd

jnDd

jr

m dD

dD

qq

a

a: Tunable weight for initial query: Tunable weight for relevant documents: Tunable weight for irrelevant documents

Page 35: Special Topics on Information Retrieval

Special Topics on Information Retrieval35

Pseudo relevance feedback

Users do not like to givemanual feedback to the system

• Use relevance feedback methods without explicit user input.

• Just assume the top m retrieved documents are relevant, and use them to reformulate the query.

• Relies largely on the systems ability to initially retrieve relevant documents.

Page 36: Special Topics on Information Retrieval

Special Topics on Information Retrieval36

Automatic global analysis

• Determine term similarity through a pre-computed statistical analysis of the complete corpus.– Compute association matrices which quantify

term correlations in terms of how frequently they co-occur.

• Expand queries with statistically most similar terms.– The same information for all queries. It is an

offline process

Page 37: Special Topics on Information Retrieval

Special Topics on Information Retrieval37

Clustering in information retrieval• Cluster hypothesis: Documents in the same cluster

behave similarly with respect to relevance to information needs.– If there is a document from a cluster that is relevant to a

search request, then it is likely that other documents from the same cluster are also relevant.

• Two main uses:– Collection clustering

• Higher efficiency: faster search• Tends to improves recall

– Search results clustering• More effective information presentation to user