IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query...
-
Upload
karin-fletcher -
Category
Documents
-
view
214 -
download
0
Transcript of IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query...
![Page 1: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/1.jpg)
IR Models
J. H. WangMar. 11, 2008
![Page 2: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/2.jpg)
The Retrieval ProcessUserInterface
Text Operations
Query Operations
Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager Module
4, 10
6, 7
5 8
2
8
Text Database
Text
![Page 3: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/3.jpg)
Introduction
• Traditional information retrieval systems usually adopt index terms to index and retrieve documents– An index term is a keyword (or group of related
words) which has some meaning of its own (usually a noun)
• Advantages– Simple– The semantic of the documents and of the user
information need can be naturally expressed through sets of index terms
![Page 4: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/4.jpg)
Docs
Information Need
Index Terms
doc
query
Rankingmatch
![Page 5: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/5.jpg)
IR Models
Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).
![Page 6: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/6.jpg)
A Taxonomy of Information Retrieval Models
Retrieval:Ad hoc
Filtering
Classic Models
Browsing
USER
TASK
BooleanVector
Probabilistic
Structured Models
Non-overlapping listsProximal Nodes
FlatStructure Guided
Hypertext
Browsing
FuzzyExtended Boolean
Set Theoretic
AlgebraicGeneralized VectorLat. Semantic Index
Neural Networks
Inference NetworkBelief Network
Probabilistic
![Page 7: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/7.jpg)
Structure Guided Hypertext
FlatHypertext
FlatBrowsing
StructuredClassicSet TheoreticAlgebraicProbabilistic
ClassicSet TheoreticAlgebraicProbabilistic
Retrieval
Full Text+Structure
Full TextIndex Terms
Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.
![Page 8: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/8.jpg)
Retrieval : Ad hoc and Filtering
• Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system
• Routing (Filtering): The queries remain relatively static while new documents come into the system
![Page 9: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/9.jpg)
Retrieval: Ad Hoc x Filtering
• Ad hoc retrieval:
Collection“Fixed Size”
Q2
Q3
Q1
Q4Q5
![Page 10: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/10.jpg)
Retrieval: Ad Hoc x Filtering
• Filtering:
Documents Stream
User 1Profile
User 2Profile
Docs Filteredfor User 2
Docs forUser 1
![Page 11: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/11.jpg)
A Formal Characterization of IR Models
• D : A set composed of logical views (or representation) for the documents in the collection
• Q : A set composed of logical views (or representation) for the user information needs (queries)
• F : A framework for modeling document representations, queries, and their relationships
• R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query
![Page 12: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/12.jpg)
Definition
• ki : A generic index term• K : The set of all index terms {k1,…,kt}• wi,j : A weight associated with index term ki of a document dj
• gi: A function returns the weight associated with ki in any t-dimensional vector ( gi(dj)=wi,j )
![Page 13: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/13.jpg)
Classic IR Model
• Basic concepts: Each document is described by a set of representative keywords called index terms
• Assign a numerical weights to distinct relevance between index terms
• Three classic models: Boolean, vector, probabilistic
![Page 14: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/14.jpg)
Boolean Model
• Binary decision criterion– Either relevant or nonrelevant (no partial match)
• Data retrieval model• Advantage
– Clean formalism, simplicity• Disadvantage
– It is not simple to translate an information need into a Boolean expression
– Exact matching may lead to retrieval of too few or too many documents
![Page 15: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/15.jpg)
Example
• Can be represented as a disjunction of conjunctive vectors (in DNF)– Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)
• Formal definition– For the Boolean model, the index term weight
are all binary, i.e. wij {0,1}– A query is a conventional Boolean expression,
which can be transformed to a disjunctive normal form (qcc: conjunctive component)
if (qcc )(ki, wi,j=gi(qcc))dnfq
0
1),( qdsim j
dnfq
(1,1,1)(1,0,0)
(1,1,0)Ka Kb
Kc
![Page 16: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/16.jpg)
Vector Model [Salton, 1968]
• Assign non-binary weights to index terms in queries and in documents => TFxIDF
• Compute the similarity between documents and query => Sim(Dj, Q)
• More precise than Boolean model
![Page 17: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/17.jpg)
The IR Problem A Clustering Problem
• We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects
• Intra-cluster similarity– What are the features which better describe
the objects in the set A?
• Inter-cluster similarity– What are the features which better distinguish
the objects in the set A?
![Page 18: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/18.jpg)
• TF: intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj– term frequency (the tf factor) provides one mea
sure of how well that term describes the document contents
• IDF: inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection– inverse document frequency (the idf factor)
Idea for TFxIDF
![Page 19: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/19.jpg)
Vector Model (1/4)
• Index terms are assigned positive and non-binary weights
• The index terms in the query are also weighted
• Term weights are used to compute the degree of similarity between documents and the user query
• Then, retrieved documents are sorted in decreasing order
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd
![Page 20: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/20.jpg)
Vector Model (2/4)
• Degree of similarity
t
i qi
t
i ji
t
i qiji
j
jj
ww
ww
qd
qdqdsim
1
2,1
2,
1 ,,
||||),(
dj
q
Figure 2.4 The cosine of is adoptedas sim(dj,q)
![Page 21: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/21.jpg)
Vector Model (3/4)
• Definition– normalized frequency
– inverse document frequency
– term-weighting schemes
– query-term weights
jll
jiji freq
freqf
,
,, max
ii n
Nidf log
ijiji idffreqw ,,
iqll
qiqi n
N
freq
freqw log)
max
5.05.0(
,
,,
![Page 22: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/22.jpg)
Vector Model (4/4)
• Advantages– Its term-weighting scheme improves retrieval
performance– Its partial matching strategy allows retrieval of
documents that approximate the query conditions– Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
• Disadvantage– The assumption of mutual independence between
index terms
![Page 23: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/23.jpg)
The Vector Model: Example I
k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
![Page 24: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/24.jpg)
The Vector Model: Example II
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2
q 1 2 3
![Page 25: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/25.jpg)
The Vector Model: Example III
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
![Page 26: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/26.jpg)
Probabilistic Model (1/6)
• Introduced by Roberston and Sparck Jones, 1976– Binary independence retrieval (BIR) model
• Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set– Assumption (probabilistic principle): the probability of releva
nce depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
![Page 27: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/27.jpg)
Probabilistic Model (2/6)
• Definition– All index term weights are all binary i.e., wi,j {0,1}
– Let R be the set of documents known to be relevant to query q
– Let be the complement of R– Let be the probability that the docu
ment dj is relevant to the query q– Let be the probability that the docu
ment dj is nonelevant to query q
R)|( jdRP
)|( jdRP
![Page 28: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/28.jpg)
Probabilistic Model (3/6)
• The similarity sim(dj,q) of the document dj to the query q is defined as the ratio
• Using Bayes’ rule,
– P(R) stands for the probability that a document randomly selected from the entire collection is relevant
– stands for the probability of randomly selecting the document dj from the set R of relevant documents
)|Pr(
)|Pr(),(
j
jj
dR
dRqdsim
)Pr()|Pr(
)Pr()|Pr(),(
RRd
RRdqdsim
j
jj
)|( RdP j
![Page 29: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/29.jpg)
Probabilistic Model (4/6)
• Assuming independence of index terms and given q=(d1, d2, …, dt),
t
iiij
t
iiij
RdkRd
RdkRd
1
1
)|Pr()|Pr(
)|Pr()|Pr(
)Pr(
)Pr(log
)|Pr(
)|Pr(log),(
R
R
Rd
Rdqdsim
j
jj
t
iii
t
iii
j
Rdk
Rdkqdsim
1
1
)|Pr(
)|Pr(log),(
![Page 30: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/30.jpg)
Probabilistic Model (5/6)
– Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R
– stands for the probability that the index term ki is not present in a document randomly selected from the set R
)|Pr( Rki
![Page 31: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/31.jpg)
Probabilistic Model (6/6)
1)( 0)(
1)( 0)(
)|Pr()|Pr(
)|Pr()|Pr(),(
ji ji
ji ji
dg dg ii
dg dg ii
jRkRk
RkRkqdsim
1)|Pr()|Pr( RkRk ii
t
i i
i
i
ijiqij
RkP
RkP
RkP
RkPwwqdsim
1,, )|(
)|(1log
)|(1
)|(log),(
t
i i
i
i
ij
RkP
RkP
RkP
RkPqdsim
1 )|(
)|(1log
)|(1
)|(log),(
![Page 32: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/32.jpg)
Estimation of Term Relevance
In the very beginning:
Next, the ranking can be improved as follows:
For small values of V and Vi
N
dfRk
Rk
ii
i
)|Pr(
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
)|Pr(
)|Pr(
1
5.0)|Pr(
1
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
Let V be a subset of the documents initially retrieved
1)|Pr(
1)|Pr(
VN
VdfRk
V
VRk
VV
iii
VV
ii
i
i
N
Vdfi
![Page 33: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/33.jpg)
• Advantage– Documents are ranked in decreasing order
of their probability of being relevant• Disadvantage
– The need to guess the initial relevant and nonrelevant sets
– Term frequency is not considered– Independence assumption for index terms
![Page 34: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/34.jpg)
Brief Comparison of Classic Models
• Boolean model is the weakest– Not able to recognize partial matches
• Controversy between probabilistic and vector models– The vector model is expected to
outperform the probabilistic model with general collections
![Page 35: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/35.jpg)
Alternative Set Theoretic Models
• Fuzzy Set Model• Extended Boolean Model
![Page 36: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/36.jpg)
Fuzzy Theory
• A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA
• Let A and B be two fuzzy subsets of U,
),min(
),max(
1
BABA
BABA
AA
![Page 37: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/37.jpg)
Fuzzy Information Retrieval
• Using a term-term correlation matrix
• Define a fuzzy set associated to each index term ki
– If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1
– If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0
vuvu
vuvu dfdfdf
dfc
,
,,
ji dk
liji cd )1(1)( ,
![Page 38: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/38.jpg)
Example
• Disjunctive Normal Form
)( cbadnf kkkq
)()()( cbacbacbadnf kkkkkkkkkq
)1)(1()(
)1()(
)()()()(
,,,,,
,,,,,
,,,,,
jcjbjajcba
jcjbjajcba
jcjbjajcjbjajcba
uuudu
uuudu
uuududududu
))(1())(1())(1(1
)1(1)(
,,,,,,
3
1111
jcbajcbajcba
cci
ccccccjq
ddd
di
cc1cc3
cc2Ka Kb
Kc
![Page 39: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/39.jpg)
Algebraic Sum and Product
• The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function
• The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function
• More smooth than max and min functions
![Page 40: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/40.jpg)
Alternative Algebraic Models
• Generalized Vector Space Model• Latent Semantic Model•Neural Network Model
![Page 41: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/41.jpg)
Sparse Matrix Problem
• Considering a term-doc matrix of dimensions 1M*1M– Most of the entries will be 0 sparse matrix– A waste of storage and computation– How to reduce the dimensions?
![Page 42: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/42.jpg)
Latent Semantic Indexing (1/5)
• Let M=(Mij) be a term-document association matrix with t rows and N columns
• Latent semantic indexing decomposes M using Singular Value Decompositions
– K is the matrix of eigenvectors derived from the term-to-term correlation matrix (MMt)
– Dt is the matrix of eigenvectors derived from the transpose of the document-to-document matrix (MtM)
– S is an rr diagonal matrix of singular values, where r=min(t,N) is the rank of M
tKSDM
![Page 43: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/43.jpg)
Latent Semantic Indexing (2/5)
• Consider now only the s largest singular values of S, and their corresponding columns in K and Dt
– (The remaining singular values of S are deleted)
• The resultant matrix Ms (rank s) is closest to the original matrix M in the least square sense
• s<r is the dimensionality of a reduced concept space
tssss DSKM
![Page 44: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/44.jpg)
Latent Semantic Indexing (3/5)
• The selection of s attempts to balance two opposing effects– s should be large enough to allow fitting
all the structure in the real data– s should be small enough to allow
filtering out all the non-relevant representational details
![Page 45: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/45.jpg)
Latent Semantic Indexing (4/5)
• Consider the relationship between any two documents
tssss
tssss
tsss
tsss
tsss
ttsss
t
SDSD
DSSD
DSKKSD
DSKDSKMMss
))((
)(
![Page 46: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/46.jpg)
Latent Semantic Indexing (5/5)
• To rank documents with regard to a given user query, we model the query as a pseudo-document in the original matrix M– Assume the query is modeled as the docum
ent with number k – Then the kth row in the matrix provides
the ranks of all documents with respect to this query
ssMM t
![Page 47: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/47.jpg)
Computing an Example• Let (Mij) be given by the matrix
– Compute the matrices (K), (S), and (D)t
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
![Page 48: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/48.jpg)
• Latent Semantic Indexing transforms the occurrence matrix into a relation between the terms and concepts, and a relation between the concepts and the documents– Indirect relation between terms and
documents through some hidden (or latent) conceptsTaipei
Taiwan
…doc
?
![Page 49: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/49.jpg)
Taipei
Taiwan
…doc
(Latent)Concepts
![Page 50: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/50.jpg)
Alternative Probabilistic Model
• Bayesian Networks• Inference Network Model
• Belief Network Model
![Page 51: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/51.jpg)
Bayesian Network
• Let xi be a node in a Bayesian network G and xi
be the set of parent nodes of xi
• The influence of xi on xi can be specified by
any set of functions that satisfy:
• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
1),(0
1),(
i
i
i
xii
xxii
xF
xF
![Page 52: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/52.jpg)
Belief Network Model (1/6)
• The probability spaceThe set K={k1, k2, …, kt} of all index terms is the universe. To each subset u is associated a vector such that gi( )=1 kiu
• Random variables– To each index term ki is associated a binary
random variable
k
k
![Page 53: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/53.jpg)
Belief Network Model (2/6)
• Concept space– A document dj is represented as a concept c
omposed of the terms used to index dj– A user query q is also represented as a conc
ept composed of the terms used to index q– Both user query and document are modeled
as subsets of index terms• Probability distribution P over K
t
u
uP
uPucPcP
)2
1()(
)()|()(
Degree of coverage of K by c
![Page 54: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/54.jpg)
Belief Network Model (3/6)
• A query q is modeled as a network node– This random variable is set to 1 whenever q comple
tely covers the concept space K– P(q) computes the degree of coverage of the space
K by q• A document dj is modeled as a network node
– This random variable is 1 to indicate that dj completely covers the concept space K
– P(dj) computes the degree of coverage of the space K by dj
![Page 55: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/55.jpg)
Belief Network Model (4/6)
![Page 56: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/56.jpg)
Belief Network Model (5/6)
• Assumption – P(dj |q) is adopted as the rank of the docum
ent dj with respect to the query q
kj
uj
ujj
jj
kPkqPkdP
uPuqPudP
uPuqdPqdP
qPqdPqdP
)()|()|(
)()|()|(
)()|()(
)(/)()|(
![Page 57: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/57.jpg)
Belief Network Model (6/6)
• Specify the conditional probabilities as follows
• Thus, the belief network model can be tuned to subsume the vector model
otherwise
qgkkifkqP
otherwise
dgkkifkdP
iiw
w
jiiw
w
j
ti qi
qi
ti ji
ji
1)(
0)|(
1)(
0)|(
12,
,
12,
,
![Page 58: IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.](https://reader035.fdocuments.in/reader035/viewer/2022062719/56649ed35503460f94be41d2/html5/thumbnails/58.jpg)
Comparison
• Belief network model – Is based on set-theoretic view– It provides a separation between the
document and the query – It is able to reproduce any ranking strategy
generated by the inference network model
• Inference network model– Takes a purely epistemological view which
is more difficult to grasp