Information Retrieval Models

48
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz

description

Information Retrieval Models. School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz. What is Information Retrieval. Information Retrieval deals with information items in terms of Representation Storage Organization Access - PowerPoint PPT Presentation

Transcript of Information Retrieval Models

Page 1: Information Retrieval Models

Information Retrieval Models

School of InformaticsDept. of Library and Information StudiesDr. Miguel E. Ruiz

Page 2: Information Retrieval Models

What is Information Retrieval Information Retrieval deals with

information items in terms of Representation Storage Organization Access

An IR system should provide access to the information that the user is interested in.

Page 3: Information Retrieval Models

Example of an Information need“Find all documents containing information

on college tennis teams which: (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. To be relevant, the document must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.”

Page 4: Information Retrieval Models

The user must translate his/her information needs into a query.

Most commonly a query is a set of keywords that summarizes the information needs.

Page 5: Information Retrieval Models

Information Retrieval vs. Data Retrieval Data retrieval consists of determining

which documents of the collection contain the keywords in the user query.

Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevantrelevant to the user query while retrieving a few non relevant documents as possible.

Page 6: Information Retrieval Models

Basic Concepts Effective retrieval of relevant

information is affected by: the user task the logical view of the documents

Page 7: Information Retrieval Models

The User Task

Retrieval

Browsing

Database

Page 8: Information Retrieval Models

Logical View of a Document Documents can be represented as:

A set of keywords or indexing terms Full text

Page 9: Information Retrieval Models

Logical View of the Document

DocumentText +Structure

Accents,Spacing,

ect.stopwords

Noungroups

stemmingAutomaticor manual indexing.

StructureRecognition

Text

Full text

Structure

Index term

Page 10: Information Retrieval Models

The Retrieval ProcessUser

Interface

Text Operations

QueryOperations

Searching

IndexingDB ManagerModule

Index

TextDatabase

Ranking

User’s need

Ranked Docs

Retrieved Docs

Query

User’s feedback

Text

Text

Logical view

Inverted file

Page 11: Information Retrieval Models

IR Models An IR model is a quadruple [D,Q,F, R(qi,dj)]

D: set of logical representations of the documents

Q: set of logical representation of the queries F : framework for modeling document

representations, queries, and their relationships R(qi,dj): ranking function that defines an

association between the query and the documents. This ranking defines an ordering among the documents regarding the query.

Page 12: Information Retrieval Models

User

Tasks Browsing

Classic Models•Boolean•Vector space•Probabilistic

Structured Models•Non Overlapping Lists•Proximal nodes

Browsing•Flat•Structure Guided•Hypertext

Set Theoretic

•Fuzzy•Extended Boolean

Algebraic•Generalized Vector•Lat. Semantic Index•Neural Networks

Probabilistic

•Inference Network•Belief Network

Retrieval:AdhocFiltering

Page 13: Information Retrieval Models

Retrieval Models and Logical View of Documents

Structured GuidedHypertext

FlatHypertext

FlatBrowsing

StructuredClassicSet TheoreticAlgebraicProbabilistic

ClassicSet TheoreticAlgebraicProbabilistic

Retrieval

Full Text + Structure

Full TextIndex Terms

Page 14: Information Retrieval Models

IR Models Basic concepts:

Each document is described as a set of representative keywords called index terms.

An index term is a word (which can be in the document) that helps in remembering the document’s main themes.

Index terms are used to index and summarize the document contents

Page 15: Information Retrieval Models

IR Models Basic concepts (cont.)

Index terms have varying relevance when used to describe the document contents. This effect is captured by assigning numerical weights to each index term in the document.

A weight is a positive value associated with each index term in the document.

Page 16: Information Retrieval Models

IR Models The Boolean Model is a simple

retrieval model based on set theory and Boolean algebra. Documents are represented by the

index terms assigned to the document. There is no indication on which terms are more important than others ( weights are binary either 0 or 1)

Page 17: Information Retrieval Models

IR Models Boolean Model (cont.)

The Boolean operators used are Conjunction (AND, ) Disjunction (OR, ) Negation (NOT, )

Queries are specified as conventional Boolean expressions which can be represented as a disjunction of conjunctive forms vectors (disjunctive normal form - DNF)

Page 18: Information Retrieval Models

IR Models Boolean model

Examples:

Q = Safety (Car Industry) Qdnf = (1,1,1) (1,1,0) (1,0,0)

Page 19: Information Retrieval Models

IR Models Boolean Model

Advantages Disadvantages

Clean formalism Easy to implement

Exact matching may retrieve too few or too many documents.Expressing an information need as a Boolean expression might be challenging

Page 20: Information Retrieval Models

IR Models Vector Space Model

Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query).

),,,(

),,,(

,,2,1

,,2,1

qtqq

jtjjj

wwwq

wwwd

Page 21: Information Retrieval Models

IR Models In other words, the document dj

and the query q are represented as t-dimensional vectors.

dj

q

Page 22: Information Retrieval Models

IR Model The vector space model proposes

to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors dj and q.

jd

Page 23: Information Retrieval Models

IR Models This correlation can be quantified

in different way, for example by the cosine of the angle between these two vectors.

t

i qi

t

i ji

t

i qiji

j

jj

ww

ww

qd

qdqdsim

1

2,1

2,

1 ,,

||||),(

Page 24: Information Retrieval Models

IR Models Since wi,j 0 and Wi,q 0 ,

sim(dj,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query. Thus space models ranks the retrieved documents by the similarity value.

Page 25: Information Retrieval Models

IR Models How can we compute the values of

the weights wi,j ? One of the most popular methods is

based on combining two factors: The importance of each index term in the

document The importance of the index term in the

collection

Page 26: Information Retrieval Models

IR Models Importance of the index term in

the document: This can be measured by the number

of times that the term appears in the document. The higher the number of times that it is mentioned in the document the better the term is. This is called the term frequency which is denoted by the symbol tf.

Page 27: Information Retrieval Models

IR Models The importance of the index term

in the collection: An index term that appears in every

document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term.

Page 28: Information Retrieval Models

IR Model

In other words, the importance of an index term in the collection is quantified by the inverse of the frequency that this term has among the documents in the collection. This factor is usually called the inverse document frequency or the idf factor.

Page 29: Information Retrieval Models

IR ModelsMathematically this can be expressed as:

Where: N= number of documents in the collectionni = number of document that contain the

term i

ii n

Nidf log

Page 30: Information Retrieval Models

IR Models Combining these two factors we can

obtain the weight of an index term i as:

Also called the tf-idf weighting scheme

i

ji

n

Ntf

idftfw

log

,

Page 31: Information Retrieval Models

IR Models Vector Space Model

Advantages Disadvantages

Its term weighting scheme can improve retrieval performanceAllows partial matchingRetrieved documents are sorted according to their degree of similarity

Terms are assumed to be mutually independent. In some cases this might hurt performance.

Page 32: Information Retrieval Models

IR Models Probabilistic model

Assumption: given a query q and a document dj in the collection, this model tries to estimate the probability that the user will find the document dj relevant.

The model assumes that there exists an ideal subset R of the document collection that contains only relevant documents.

Page 33: Information Retrieval Models

Similarity in the probabilistic model is computed as:

Where: P(ki|R) is the probability that index term ki is present in a

document randomly selected from the set R

IR Models

t

i i

i

i

ijiqij RkP

RkP

RkP

RkPwwqdsim

1,, )|(

)|(1log

)|(1

)|(log),(

1,0, jiw 1,0, qiw

Page 34: Information Retrieval Models

IR Models Estimation of P(ki|R) and P(ki|R):

Initial constant values:

Iterative process to improve estimates

_

N

niRkP

RkP

i

i

)|(

5.0)|(

1

5.0)|(

1

5.0)|(

VN

VnRkP

V

VRkP

iii

iiV is the set of documents

retrieved

Vi is the set of documents in V that contain ki

Page 35: Information Retrieval Models

IR Models Probabilistic Model

Advantages Disadvantages

Retrieved documents are ranked according to their probability of being relevant.

The need to get the initial separation of docs in R and non-R The method does not take into account frequency of index terms Independence assumption.

Page 36: Information Retrieval Models

User

Tasks Browsing

Classic Models•Boolean•Vector space•Probabilistic

Structured Models•Non Overlapping Lists•Proximal nodes

Browsing•Flat•Structure Guided•Hypertext

Set Theoretic

•Fuzzy•Extended Boolean

Algebraic•Generalized Vector•Lat. Semantic Index•Neural Networks

Probabilistic

•Inference Network•Belief Network

Retrieval:AdhocFiltering

Page 37: Information Retrieval Models

Retrieval Evaluation

Collection

Relevant docs in the Answer set

|Ra|

||

||Re

R

Racall

||

||Pr

A

Raecision

Rel. docs|R|

Answer set|A|

Page 38: Information Retrieval Models

Retrieval Evaluation Pooling

System1

System2

System3…

pooling

Pool of combined

results

Use top K documents from each run

User evaluation

Page 39: Information Retrieval Models

Retrieval Evaluation Relevance:

Strength: it involves people (users) as judges of effectiveness of performance.

Weakness: It involves judgments by people, with all the associated problems of subjectivity and variability.

Page 40: Information Retrieval Models

Retrieval Evaluation Types of relevance: (Saracevic, 1999)

System or algorithmic relevance: relation between the query and information objects retrieved.

Topical or Subject relevance: relation between the topic expressed in the query and the topic covered in the retrieved documents.

Cognitive relevance or pertinence: relation between the state of knowledge and cognitive information need of a user and documents retrieved.

Situational relevance or utility: relation between the task or problem at hand ant and the documents retrieved.

Motivational or affective relevance: relation between the intents, goals and motivations of a user and the documents retrieved.

Page 41: Information Retrieval Models

Retrieval Evaluation T. Joachims. Evaluating retrieval

performance using clickthrough data. In Text Mining: theoretical aspects and applications. Physca-Verlag, 79-96, 2003.

Available online: http://www.cs.cornell.edu/People/tj/publications/joachims_02b.pdf

Page 42: Information Retrieval Models

Clickthrough for IR Performance Main idea:

Use a unified interface to submit results to two search engines.

Estimate relevance information from links visited from the results returned by a search engine.

Page 43: Information Retrieval Models

Clickthrough for IR Performance Regular clickthrough data:

User types a query into a unified interface.

The query is sent to both search engines A & B.

Randomly select one list of results and present it to the user

Ranks of links clicked by user are recorded

Page 44: Information Retrieval Models

Clickthrough for IR Performance Unbiased clickthrough data for comparing

retrieval functions: Blind test: Interface should hide the random

variables underlying the hypothesis test to avoid biasing the user response. (placebo effect)

Click Preference: design the interface so that interaction with the system demonstrates a particular judgement by the user.

Low usability impact

Page 45: Information Retrieval Models

Clickthrough for IR PerformanceInput: ranking A = (a1, a2,…), ranking B= (b1, b2,…)Call: combine(A,B,0,0,)Output:combined ranking DCombine (A,B,ka,kb,D)

if (ka = kb) {if(A[ka+1] D) { D:= D+A[ka+1];}combine (A,B,ka+1,kb,D);

}else {

if (B[kb+1] D) { D:= D+B[kb+1];}combine (A,B,ka,kb+1,D);

}

Page 46: Information Retrieval Models

Clickthrough for IR Performance Hypothesis test:

Use a binomial sign test (i.e., McNemar’s test) to detect significant deviations from the median

Page 47: Information Retrieval Models

Clickthrough for IR Performance Experimental design:

Interface that sends query to two systems:

Google and MSNsearch. Google and Default (50 links from

MSNsearch in reverse orde) MSNsearch and Default

Page 48: Information Retrieval Models

Clickthrough for IR Performance Does the clickthrough evaluation

agree with the relevance judgements? Their conclusion in the small

experiment is that there is a string correlation between relevance judgements and clickthrough data.

Can this be a generalized conclusion?