IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query...

IR Models

J. H. WangMar. 11, 2008

The Retrieval ProcessUserInterface

Text Operations

Query Operations

Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

Introduction

• Traditional information retrieval systems usually adopt index terms to index and retrieve documents– An index term is a keyword (or group of related

words) which has some meaning of its own (usually a noun)

• Advantages– Simple– The semantic of the documents and of the user

information need can be naturally expressed through sets of index terms

Docs

Information Need

Index Terms

doc

query

Rankingmatch

IR Models

Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

A Taxonomy of Information Retrieval Models

Retrieval:Ad hoc

Filtering

Classic Models

Browsing

USER

TASK

BooleanVector

Probabilistic

Structured Models

Non-overlapping listsProximal Nodes

FlatStructure Guided

Hypertext

Browsing

FuzzyExtended Boolean

Set Theoretic

AlgebraicGeneralized VectorLat. Semantic Index

Neural Networks

Inference NetworkBelief Network

Probabilistic

Structure Guided Hypertext

FlatHypertext

FlatBrowsing

StructuredClassicSet TheoreticAlgebraicProbabilistic

ClassicSet TheoreticAlgebraicProbabilistic

Retrieval

Full Text+Structure

Full TextIndex Terms

Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Retrieval : Ad hoc and Filtering

• Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system

• Routing (Filtering): The queries remain relatively static while new documents come into the system

Retrieval: Ad Hoc x Filtering

• Ad hoc retrieval:

Collection“Fixed Size”

Q2

Q3

Q1

Q4Q5

Retrieval: Ad Hoc x Filtering

• Filtering:

Documents Stream

User 1Profile

User 2Profile

Docs Filteredfor User 2

Docs forUser 1

A Formal Characterization of IR Models

• D : A set composed of logical views (or representation) for the documents in the collection

• Q : A set composed of logical views (or representation) for the user information needs (queries)

• F : A framework for modeling document representations, queries, and their relationships

• R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query

Definition

• ki : A generic index term• K : The set of all index terms {k1,…,kt}• wi,j : A weight associated with index term ki of a document dj

• gi: A function returns the weight associated with ki in any t-dimensional vector ( gi(dj)=wi,j )

Classic IR Model

• Basic concepts: Each document is described by a set of representative keywords called index terms

• Assign a numerical weights to distinct relevance between index terms

• Three classic models: Boolean, vector, probabilistic

Boolean Model

• Binary decision criterion– Either relevant or nonrelevant (no partial match)

• Data retrieval model• Advantage

– Clean formalism, simplicity• Disadvantage

– It is not simple to translate an information need into a Boolean expression

– Exact matching may lead to retrieval of too few or too many documents

Example

• Can be represented as a disjunction of conjunctive vectors (in DNF)– Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)

• Formal definition– For the Boolean model, the index term weight

are all binary, i.e. wij {0,1}– A query is a conventional Boolean expression,

which can be transformed to a disjunctive normal form (qcc: conjunctive component)

if (qcc )(ki, wi,j=gi(qcc))dnfq

0

1),( qdsim j

dnfq

(1,1,1)(1,0,0)

(1,1,0)Ka Kb

Kc

Vector Model [Salton, 1968]

• Assign non-binary weights to index terms in queries and in documents => TFxIDF

• Compute the similarity between documents and query => Sim(Dj, Q)

• More precise than Boolean model

The IR Problem A Clustering Problem

• We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects

• Intra-cluster similarity– What are the features which better describe

the objects in the set A?

• Inter-cluster similarity– What are the features which better distinguish

the objects in the set A?

• TF: intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj– term frequency (the tf factor) provides one mea

sure of how well that term describes the document contents

• IDF: inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection– inverse document frequency (the idf factor)

Idea for TFxIDF

Vector Model (1/4)

• Index terms are assigned positive and non-binary weights

• The index terms in the query are also weighted

• Term weights are used to compute the degree of similarity between documents and the user query

• Then, retrieved documents are sorted in decreasing order

),,,(

),,,(

,,2,1

,,2,1

qtqq

jtjjj

wwwq

wwwd

Vector Model (2/4)

• Degree of similarity

t

i qi

t

i ji

t

i qiji

j

jj

ww

ww

qd

qdqdsim

1

2,1

2,

1 ,,

||||),(

dj

q

Figure 2.4 The cosine of is adoptedas sim(dj,q)

Vector Model (3/4)

• Definition– normalized frequency

– inverse document frequency

– term-weighting schemes

– query-term weights

jll

jiji freq

freqf

,

,, max

ii n

Nidf log

ijiji idffreqw ,,

iqll

qiqi n

N

freq

freqw log)

max

5.05.0(

,

,,

Vector Model (4/4)

• Advantages– Its term-weighting scheme improves retrieval

performance– Its partial matching strategy allows retrieval of

documents that approximate the query conditions– Its cosine ranking formula sorts the documents

according to their degree of similarity to the query

• Disadvantage– The assumption of mutual independence between

index terms

The Vector Model: Example I

k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1

q 1 1 1

d1

d2

d3d4 d5

d6d7

k1k2

k3

The Vector Model: Example II

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2

q 1 2 3

The Vector Model: Example III

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3

Probabilistic Model (1/6)

• Introduced by Roberston and Sparck Jones, 1976– Binary independence retrieval (BIR) model

• Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set– Assumption (probabilistic principle): the probability of releva

nce depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance

– The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)


• Definition– All index term weights are all binary i.e., wi,j {0,1}

– Let R be the set of documents known to be relevant to query q

– Let be the complement of R– Let be the probability that the docu

ment dj is relevant to the query q– Let be the probability that the docu

ment dj is nonelevant to query q

R)|( jdRP

)|( jdRP


• The similarity sim(dj,q) of the document dj to the query q is defined as the ratio

• Using Bayes’ rule,

– P(R) stands for the probability that a document randomly selected from the entire collection is relevant

– stands for the probability of randomly selecting the document dj from the set R of relevant documents

)|Pr(

)|Pr(),(

j

jj

dR

dRqdsim

)Pr()|Pr(

)Pr()|Pr(),(

RRd

RRdqdsim

j

jj

)|( RdP j


• Assuming independence of index terms and given q=(d1, d2, …, dt),

t

iiij

t

iiij

RdkRd

RdkRd

1

1

)|Pr()|Pr(

)|Pr()|Pr(

)Pr(

)Pr(log

)|Pr(

)|Pr(log),(

R

R

Rd

Rdqdsim

j

jj

t

iii

t

iii

j

Rdk

Rdkqdsim

1

1

)|Pr(

)|Pr(log),(


– Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R

– stands for the probability that the index term ki is not present in a document randomly selected from the set R

)|Pr( Rki


1)( 0)(

1)( 0)(

)|Pr()|Pr(

)|Pr()|Pr(),(

ji ji

ji ji

dg dg ii

dg dg ii

jRkRk

RkRkqdsim

1)|Pr()|Pr( RkRk ii

t

i i

i

i

ijiqij

RkP

RkP

RkP

RkPwwqdsim

1,, )|(

)|(1log

)|(1

)|(log),(

t

i i

i

i

ij

RkP

RkP

RkP

RkPqdsim

1 )|(

)|(1log

)|(1

)|(log),(

Estimation of Term Relevance

In the very beginning:

Next, the ranking can be improved as follows:

For small values of V and Vi

N

dfRk

Rk

ii

i

)|Pr(

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

)|Pr(

)|Pr(

1

5.0)|Pr(

1

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

Let V be a subset of the documents initially retrieved

1)|Pr(

1)|Pr(

VN

VdfRk

V

VRk

VV

iii

VV

ii

i

i

N

Vdfi

• Advantage– Documents are ranked in decreasing order

of their probability of being relevant• Disadvantage

– The need to guess the initial relevant and nonrelevant sets

– Term frequency is not considered– Independence assumption for index terms

Brief Comparison of Classic Models

• Boolean model is the weakest– Not able to recognize partial matches

• Controversy between probabilistic and vector models– The vector model is expected to

outperform the probabilistic model with general collections

Alternative Set Theoretic Models

• Fuzzy Set Model• Extended Boolean Model

Fuzzy Theory

• A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA

• Let A and B be two fuzzy subsets of U,

),min(

),max(

1

BABA

BABA

AA

Fuzzy Information Retrieval

• Using a term-term correlation matrix

• Define a fuzzy set associated to each index term ki

– If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1

– If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0

vuvu

vuvu dfdfdf

dfc

,

,,

ji dk

liji cd )1(1)( ,

Example

• Disjunctive Normal Form

)( cbadnf kkkq

)()()( cbacbacbadnf kkkkkkkkkq

)1)(1()(

)1()(

)()()()(

,,,,,

,,,,,

,,,,,

jcjbjajcba

jcjbjajcba

jcjbjajcjbjajcba

uuudu

uuudu

uuududududu

))(1())(1())(1(1

)1(1)(

,,,,,,

3

1111

jcbajcbajcba

cci

ccccccjq

ddd

di

cc1cc3

cc2Ka Kb

Kc

Algebraic Sum and Product

• The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function

• The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function

• More smooth than max and min functions

Alternative Algebraic Models

• Generalized Vector Space Model• Latent Semantic Model•Neural Network Model

Sparse Matrix Problem

• Considering a term-doc matrix of dimensions 1M*1M– Most of the entries will be 0 sparse matrix– A waste of storage and computation– How to reduce the dimensions?

Latent Semantic Indexing (1/5)

• Let M=(Mij) be a term-document association matrix with t rows and N columns

• Latent semantic indexing decomposes M using Singular Value Decompositions

– K is the matrix of eigenvectors derived from the term-to-term correlation matrix (MMt)

– Dt is the matrix of eigenvectors derived from the transpose of the document-to-document matrix (MtM)

– S is an rr diagonal matrix of singular values, where r=min(t,N) is the rank of M

tKSDM


• Consider now only the s largest singular values of S, and their corresponding columns in K and Dt

– (The remaining singular values of S are deleted)

• The resultant matrix Ms (rank s) is closest to the original matrix M in the least square sense

• s<r is the dimensionality of a reduced concept space

tssss DSKM


• The selection of s attempts to balance two opposing effects– s should be large enough to allow fitting

all the structure in the real data– s should be small enough to allow

filtering out all the non-relevant representational details


• Consider the relationship between any two documents

tssss

tssss

tsss

tsss

tsss

ttsss

t

SDSD

DSSD

DSKKSD

DSKDSKMMss

))((

)(


• To rank documents with regard to a given user query, we model the query as a pseudo-document in the original matrix M– Assume the query is modeled as the docum

ent with number k – Then the kth row in the matrix provides

the ranks of all documents with respect to this query

ssMM t

Computing an Example• Let (Mij) be given by the matrix

– Compute the matrices (K), (S), and (D)t

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3

• Latent Semantic Indexing transforms the occurrence matrix into a relation between the terms and concepts, and a relation between the concepts and the documents– Indirect relation between terms and

documents through some hidden (or latent) conceptsTaipei

Taiwan

…doc

?

Taipei

Taiwan

…doc

(Latent)Concepts

Alternative Probabilistic Model

• Bayesian Networks• Inference Network Model

• Belief Network Model

Bayesian Network

• Let xi be a node in a Bayesian network G and xi

be the set of parent nodes of xi

• The influence of xi on xi can be specified by

any set of functions that satisfy:

• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)

1),(0

1),(

i

i

i

xii

xxii

xF

xF

Belief Network Model (1/6)

• The probability spaceThe set K={k1, k2, …, kt} of all index terms is the universe. To each subset u is associated a vector such that gi( )=1 kiu

• Random variables– To each index term ki is associated a binary

random variable

k

k


• Concept space– A document dj is represented as a concept c

omposed of the terms used to index dj– A user query q is also represented as a conc

ept composed of the terms used to index q– Both user query and document are modeled

as subsets of index terms• Probability distribution P over K

t

u

uP

uPucPcP

)2

1()(

)()|()(

Degree of coverage of K by c


• A query q is modeled as a network node– This random variable is set to 1 whenever q comple

tely covers the concept space K– P(q) computes the degree of coverage of the space

K by q• A document dj is modeled as a network node

– This random variable is 1 to indicate that dj completely covers the concept space K

– P(dj) computes the degree of coverage of the space K by dj


• Assumption – P(dj |q) is adopted as the rank of the docum

ent dj with respect to the query q

kj

uj

ujj

jj

kPkqPkdP

uPuqPudP

uPuqdPqdP

qPqdPqdP

)()|()|(

)()|()|(

)()|()(

)(/)()|(


• Specify the conditional probabilities as follows

• Thus, the belief network model can be tuned to subsume the vector model

otherwise

qgkkifkqP

otherwise

dgkkifkdP

iiw

w

jiiw

w

j

ti qi

qi

ti ji

ji

1)(

0)|(

1)(

0)|(

12,

,

12,

,

Comparison

• Belief network model – Is based on set-theoretic view– It provides a separation between the

document and the query – It is able to reproduce any ranking strategy

generated by the inference network model

• Inference network model– Takes a purely epistemological view which

is more difficult to grasp

IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query...

Documents

Transcript of IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query...