National Politics In The Gilded Age, 1877-1900 Shruti Vyas Shruti Vyas.
IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models...
Transcript of IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models...
![Page 1: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/1.jpg)
IR MODELSBY Prof.TARJNI VYAS
Tarjni Vyas 1
![Page 2: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/2.jpg)
IntroductionDocs DB
Information Need
Index Terms
Doc
Query
Ranked
List of
Docs
matchabstract
Tarjni Vyas 2
![Page 3: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/3.jpg)
IR Models
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
Tarjni Vyas 3
![Page 4: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/4.jpg)
Specifying an IR Model
• Structure Quadruple [D, Q, F, R(qi, dj)]• D = Representation of documents
• Q = Representation of queries
• F = Framework for modeling representations and their relationships• Standard language/algebra/impl. type for translation to provide semantics
• Evaluation w.r.t. “direct” semantics through benchmarks
• R = Ranking function that associates a real number with a query-doc pair
Tarjni Vyas 4
![Page 5: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/5.jpg)
About index terms
• Each document represented by a set of representative keywords orindex terms• Index terms meant to capture document’s main themes or semantics.
• Usually, index terms are nouns because nouns have meaning by themselves.
• However, search engines assume that all words are index terms (full textrepresentation)
• T1 = “conference”
• T2 = “crime”
• Adjectives, adverbs, conjunction, etc not useful.
Tarjni Vyas 5
![Page 6: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/6.jpg)
Notations/Conventions
• Ki is an index term
• dj is a document
• t is the total number of docs
• K = (k1, k2, …, kt) is the set of all index terms
• wij >= 0 is the weight associated with (ki,dj)• wij = 0 if the term is not in the doc
• vec(dj) = (w1j, w2j, …, wtj) is the weight vector associated with the document dj
• gi(vec(dj)) = wij is the function which returns the weight associated with the pair (ki,dj)
Tarjni Vyas 7
![Page 7: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/7.jpg)
The Boolean Model
• Simple model based on set theory
• Queries and documents specified as boolean expressions • precise semantics
• E.g., q = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}
Tarjni Vyas 8
![Page 8: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/8.jpg)
Example
• q = ka (kb kc)
• vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)• Disjunctive Normal Form
• vec(qcc) = (1,1,0) • Conjunctive component
• Similar/Matching documents• md1 = [ka ka d e] => (1,0,0)
• md2 = [ka kb kc] => (1,1,1)
• Unmatched documents• ud1 = [ka kc] => (1,0,1)
• ud2 = [d] => (0,0,0)
Tarjni Vyas 9
![Page 9: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/9.jpg)
Similarity/Matching function
sim(q,dj) = 1 if vec(dj) vec(qdnf))
0 otherwise
• Requires coercion for accuracy
Tarjni Vyas 10
![Page 10: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/10.jpg)
Venn Diagram
q = ka (kb kc)
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
Tarjni Vyas 11
![Page 11: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/11.jpg)
Drawback of Boolean model
• Expressive power of boolean expressions to capture information need and document semantics inadequate
• Retrieval based on binary decision criteria (with no partial match) does not reflect our intuitions behind relevance adequately
• As a result• Answer set contains either too few or too many documents in response to a
user query
• No ranking of documents
Tarjni Vyas 12
![Page 12: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/12.jpg)
Vector Model• Task:
• Document collection
• Query specifies information need: free text
• Relevance judgments: depends upon the weighting scheame for all docs
• Word evidence: Bag of words• No ordering information
Tarjni Vyas 14
![Page 13: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/13.jpg)
Vector Space Model
• Represent documents and queries as• Vectors of term-based features
• Features: tied to occurrence of terms in collection
• E.g.
• Solution 1: Binary features: t=1 if presence, 0 otherwise• Similarity: number of terms in common
• Dot product
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd
ji
N
i
kijk ttdqsim ,
1
,),(
Tarjni Vyas 16
![Page 14: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/14.jpg)
17
The Vector Model:
Example I
k1 k2 k3 q dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
Tarjni Vyas
![Page 15: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/15.jpg)
Vector Space Model II
• Problem: Not all terms equally interesting• E.g. “accuracy” vs “crime”
• Solution: Replace binary term features with weights• Document collection: term-by-document matrix
• View as vector in multidimensional space• Nearby vectors are related
• Normalize for vector length
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd
Tarjni Vyas 18
![Page 16: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/16.jpg)
Cosine similarity
t 1
d 2
d 1
t 3
t 2
θ
• Distance between vectors d1 and d2 captured by the cosine of the angle x between them.
Tarjni Vyas 19
![Page 17: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/17.jpg)
Queries in the vector space model
Central idea: the query as a vector:
• We regard the query as short document
• Note that dq is very sparse!
• We return the documents ranked by the closeness of their vectors to the query, also represented as a vector.
n
i qi
n
i ji
n
i qiji
qj
qj
qj
ww
ww
dd
ddddsim
1
2
,1
2
,
1 ,,),(
Tarjni Vyas 20
![Page 18: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/18.jpg)
Vector Similarity Computation
• Similarity = Dot product
• Normalization:• Normalize weights in advance
• Normalize post-hoc
ji
N
i
kijkjk wwdqdqsim ,
1
,),(
N
i ji
N
i ki
N
i jiki
jk
ww
wwdqsim
1
2
,1
2
,
1 ,,),(
• Cosine of angle between two vectors
• The denominator involves the lengths of the vectors.Tarjni Vyas 21
![Page 19: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/19.jpg)
Computation of weights wij and wiq
• How to compute the weights wij and wiq ?• quantification of intra-document content (similarity/semantic emphasis)
• tf factor, the term frequency within a document
• quantification of inter-document separation (dis-similarity/significant discriminant)• idf factor, the inverse document frequency
• wij = tf(i,j) * idf(i)
Tarjni Vyas 24
![Page 20: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/20.jpg)
Weighting scheme
• Let,• N be the total number of docs in the collection
• ni be the number of docs which contain ki
• freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by• f(i,j) = freq(i,j) / max(freq(l,j))
• where the maximum is computed over all terms which occur within the document dj
• The idf factor is computed as• idf(i) = log (N/ni)
• the log makes the values of tf and idf comparable.
Tarjni Vyas 25
![Page 21: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/21.jpg)
Rules:
• WARNING: In a lot of IR literature, “frequency” is used to mean “count”• Thus term frequency in IR literature is used to mean number of occurrences in
a doc
• Not divided by document length (which would actually make it a frequency)
Tarjni Vyas 26
![Page 22: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/22.jpg)
Best weighting scheme
• The best term-weighting schemes use weights which are given by • wij = f(i,j) * log(N/ni)
• the strategy is called a tf-idf weighting scheme
• For the query term weights, use• wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)
• The vector model with tf-idf weights is a good ranking strategy for general collections. • It is also simple and fast to compute.
Tarjni Vyas 27
![Page 23: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/23.jpg)
28
The Vector Model:
Example IId1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2
q 1 2 3
Tarjni Vyas
![Page 24: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/24.jpg)
29
The Vector Model:
Example III d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10
q 1 2 3
Tarjni Vyas
![Page 25: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/25.jpg)
We now consider the query “best auto carinsurance” on a fictitious collection with N =1,000,000 documents where the documentfrequencies of auto, best, car and insurance arerespectively 5000, 50000, 10000 and 1000.
Tarjni Vyas 30
![Page 26: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/26.jpg)
net score of 0+ 0 + 0.82 + 2.46 = 3.28
Tarjni Vyas 31
![Page 27: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/27.jpg)
Example 1(inverted index)
• Draw the inverted index that would be built for the following document collection.
• Doc 1 new home sales top forecasts
• Doc 2 home sales rise in july
• Doc 3 increase in home sales in july
• Doc 4 july new home sales rise
• Hint : i)arranging ii) sorting iii) merging
Tarjni Vyas 32
![Page 28: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/28.jpg)
Example 2(Boolean model)
• Consider these documents:
• Doc 1 breakthrough drug for schizophrenia
• Doc 2 new schizophrenia drug
• Doc 3 new approach for treatment of schizophrenia
• Doc 4 new hopes for schizophrenia patients
• For the document collection, Use and depict the Boolean model and what are the Returned results for these queries:
a. schizophrenia AND drug
Tarjni Vyas 33
![Page 29: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/29.jpg)
Example 3(Weighted zone scoring)
• Score according to the zone of the document.
• Consider the query shakespeare in a collection in which each document has threezones: author, title and body.
• The Boolean score function for a zone takes on the value 1 if the query termshakespeare is present in the zone, and zero otherwise.
• Weighted zone scoring in such a collection would require three weights g1, g2and g3, respectively corresponding to the author, title and body zones.
• Suppose we set g1 = 0.2, g2 = 0.3 and g3 = 0.5 (so that the three weights add upto 1); this corresponds to an application in which a match in the author zone isleast important to the overall score, the title zone somewhat more, and the bodycontributes even more.
• Thus if the term shakespeare were to appear in the title and body zones but notthe author zone of a document, the score of this document would be 0.8.
Tarjni Vyas 34
![Page 30: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/30.jpg)
Example (vector model)
• Q : “gold silver truck”
• D1 : “shipment of gold damaged in a fire”
• D2 : “delivery of silver arrived in a silver truck”
• D3 : “Shipment of gold in a truck”
• Find the ranking of the document using vector space model.
Tarjni Vyas 35
![Page 31: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/31.jpg)
Algorithm for computing vector scores
• We now initiate the study of determining the K documents with the highest vector space scores for a query.
• Typically, we seek these K top documents in ordered by decreasing score.
• for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
• Here we give the basic algorithm for this computation
Tarjni Vyas 36
![Page 32: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/32.jpg)
Algorithm for computing vector scores
• COSINESCORE(q)
1 float Scores[N] = 0
2 Initialize Length[N]
3 for each query term t
4 do calculate wt,q and fetch postings list for t
5 for each pair(d, tft,d) in postings list
6 do Scores[d] += wft,d × wt,q
7 Read the array Length[d]
8 for each d
9 do Scores[d] = Scores[d]/Length[d]
10 return Top K components of Scores[]
Tarjni Vyas 37
![Page 33: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/33.jpg)
• The array Length - the lengths for each of the N documents
• the array Scores - the scores for each of the documents.
Tarjni Vyas 38
![Page 34: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/34.jpg)
Advantages and disadvantages
• Advantages:• term-weighting improves answer set quality
• partial matching allows retrieval of docs that approximate the query conditions
• cosine ranking formula sorts documents according to degree of similarity to the query
• Disadvantages:• assumes independence of index terms; not clear that this is bad though
Tarjni Vyas 39
![Page 35: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/35.jpg)
Why use probabilities ?
• Information Retrieval deals with Uncertain Information
• Probability theory seems to be the most natural way to quantify uncertainty
Tarjni Vyas 40
![Page 36: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/36.jpg)
goal
• Collection of Documents
• User issues a query
• A Set of documents needs to be returned
• Question: In what order to present documents to user ?
• Intuitively, want the “best” document to be first, second best -second, etc…
• Need a formal way to judge the “goodness” of documents w.r.t. queries.
• Idea: Probability of relevance of the document w.r.t. query
Tarjni Vyas 41
![Page 37: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/37.jpg)
Probability Ranking Principle
If a reference retrieval system’s response to each request is a
ranking of the documents in the collections in order of
decreasing probability of usefulness to the user who
submitted the request ...
… where the probabilities are estimated as accurately a
possible on the basis of whatever data made available to
the system for this purpose ...
… then the overall effectiveness of the system to its users
will be the best that is obtainable on the basis of that data.
W.S. Cooper
Tarjni Vyas 42
![Page 38: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/38.jpg)
Probability theory
• For two events A and B, the joint event of both eventsoccurring is described by the joint probability P(A, B).
• The conditional probability P(A|B) expresses theprobability of event A given that event B occurred.
• The fundamental relationship between joint andconditional probabilities is given by the chain rule:
Tarjni Vyas 43
![Page 39: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/39.jpg)
Let us remember Probability Theory
Let a, b be two events.
)()|()()|(
)(
)()|()|(
)()|()()()|(
apabpbpbap
bp
apabpbap
apabpbapbpbap
Tarjni Vyas 44
![Page 40: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/40.jpg)
Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.
)(
)()|()|(
)(
)()|()|(
xp
NRpNRxpxNRp
xp
RpRxpxRp
p(x|R), p(x|NR) - probability that if a relevant (non-relevant)
document is retrieved, it is x.
Need to find p(R|x) - probability that a retrieved document x
is relevant.p(R),p(NR) - prior probability
of retrieving a (non) relevant
document
Tarjni Vyas 45
![Page 41: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/41.jpg)
Probability Ranking Principle
)(
)()|()|(
)(
)()|()|(
xp
NRpNRxpxNRp
xp
RpRxpxRp
Ranking Principle (Bayes’ Decision Rule):
If p(R|x) > p(NR|x) then x is relevant,
otherwise x is not relevant
Tarjni Vyas 46
![Page 42: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/42.jpg)
Some methods using some rules
ABOUT RELEVANT AND NON-RELEVANT DOCUMENT
• I1 :Distribution of the terms in relevant document is independent and distribution in all documents is independent.
• Presence of one term of the document doesn’t assure the presence of another terms they are independent and random.
• I2 :Distribution of the terms in relevant document is independent and distribution in non- relevant documents is independent.
• Satisfies I1.
• Query = “ A B C”
• A doesn’t assure presence or absence of B in the documents.
Tarjni Vyas 47
![Page 43: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/43.jpg)
Methods and assumptions
• O1 :probable relevance is based on the presence of search terms in thedocument.
this says that document should be ranked only when some terms arematching in the document.
Evidence must be found.
• O2 :probable relevance is based on both the presence of search terms inthe document and their absence from the document.
O2 doesn’t mean that we don’t know anything it just means that we arehaving some evidence for non-relevance.
So O1 and O2 means that we should consider presence and absence of allsearch terms in the query.
Tarjni Vyas 48
![Page 44: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/44.jpg)
Combination of the methods using probability
• N = number of documents in the collection • R = number of relevant documents for a given query q• n = number of documents indexed by a given term t• r = number of relevant documents indexed by the given term t
• Choosing I1 and O1 for following weight• W1 = log( (r/R) / (n/N))r/R – relevant document ration / N – depicts the importance of the term in the query if it is too much then if decreases the overall weight that is w1.
Tarjni Vyas 49
![Page 45: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/45.jpg)
Combine method
• I2 and O1• W2 = log( (r/R) / ((n-r) / (N-R)) )• (n-r) –(total indexed docs for term–actual relevant doc for term)=NR docs
for term• (N-R) – (total docs – no of relevant doc)=NR for query=total NR docs• Combine I1 and O2• W3 = log((r/R-r) / (n / N – n))• (N-n) =total doc – doc indexed by term = which are not indexed=better high • (R-r) = relevant doc query – number of relevant docs term =better low• Combine I2 and O2 • W4 = log ( (r/R-r) / ((n-r) / (N-n)-(R-r)) )• R-r = relevant doc – number of relevant docs
Tarjni Vyas 50
![Page 46: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/46.jpg)
Example for weight calculation
• Q : “gold silver truck”
• D1 : “shipment of gold damaged in a fire”
• D2 : “delivery of silver arrived in a silver truck”
• D3 : “Shipment of gold in a truck”
• Use probabilistic model and find out the appropriate rank of the document.
• Since we are using this procedure in a predictive manner ,Robertson andsparrck jones recommended adding constant to each quantity
• Add constant to r ,R,n and N 0.5,1,1 and 2 respectively and then calculatew1.
Tarjni Vyas 51
![Page 47: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/47.jpg)
Exampleapply the formulas find w1,w2,w3 and w4 for term and document.
gold silver truck
N 3 3 3
n 2 1 2
R 2 2 2
r 1 1 2
N = number ofdocuments in thecollectionR = number ofrelevant documentsfor a given query qn = number ofdocuments indexedby a given term tr = number ofrelevant documentsindexed by the giventerm tTarjni Vyas 52
![Page 48: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/48.jpg)
Modified weight formulas
• W1 = log[(r+0.5)/(R1+1)]/[(n+1)/(N+2)]
• W2 = log[(r+0.5)/(R+1)] / [(n-r+0.5) / (N-R+1)]
• W3 = log[(r+0.5)/(R-r+0.5)]/[(n+1)/(N-n+1)]
• W4 = log[(r+0.5)/[(R-r+0.5)]/[(n-r+0.5)]/[(N-n-(R-r)+0.5)]
Tarjni Vyas 53
![Page 49: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/49.jpg)
Term weights and document weights
Doc weights W1 W2 W3 W4
D1 -0.079 -0.0176 -0.0176 -0.477
D2 0.240 0.824 0.699 1.653
D3 0.064 0.347 0.347 0.699
Term weights W1 W2 W3 W4
GOLD -0.079 -0.0176 -0.0176 -0.477
SILVER 0.097 0.301 0.176 0.477
TRUCK 0.143 0.523 0.523 1.176
Tarjni Vyas 54
![Page 50: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/50.jpg)
Language model
• Traditional model of a language, of the kind familiar GENERATIVEMODEL from formal language theory, can be used either to recognizeor to generate strings.
• The full set of strings that can be generated is called the language ofthe automaton.
Tarjni Vyas 57
![Page 51: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/51.jpg)
Language model for query
hot
dog
restaurantin
city
Starting state
Accepting state
Tarjni Vyas 58
![Page 52: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/52.jpg)
Example
Suppose, now, that we have two language models M1 and M2, shown below. find the probability estimate a sequence of “frog said that toad likes frog” for continuing the word we have the probability of 0.8 and stopping probability is 0.2.
Tarjni Vyas 59
![Page 53: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/53.jpg)
Example
• To find the probability of a word sequence, we just multiply theprobabilities which the model gives to each word in the sequence,together with the probability of continuing or stopping afterproducing each word.
• P(frog said that toad likes frog) = (0.01 ×0.03 ×0.04 × 0.01 × 0.02 ×0.01) ×(0.8 ×0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)
• ≈ 0.000000000001573
• The first line of numbers are the term emission probabilities, and thesecond line gives the probability of continuing or stopping aftergenerating each word.
Tarjni Vyas 60
![Page 54: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/54.jpg)
Types of language models
• The simplest form of language model simply throws away allconditioning context, and estimates each term independently. Such amodel is called a UNIGRAM LANGUAGE MODEL.
Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
• Such a model places a probability distribution over any sequence ofwords.
Tarjni Vyas 61
![Page 55: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/55.jpg)
Another types of models
• There are many more complex kinds of language models, such asbigram MODEL which condition on the previous term,
Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
• the probability of a sequence of events into the probability of eachsuccessive event conditioned on earlier events:
P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)
Tarjni Vyas 62
![Page 56: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/56.jpg)
Exercise
• Can you find out the examples of each type of the language models?• Unigram model• Bigram model• Successive model
• Read the following and prepare the report.
• Relevant document retrieval via discrete stochastic optimization
• urlhttp://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6716603&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6716603
Tarjni Vyas 63
![Page 57: IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing](https://reader035.fdocuments.in/reader035/viewer/2022062510/612d65f31ecc5158694229ff/html5/thumbnails/57.jpg)
Expercise
• An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval
• url
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4039288&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4039288
Tarjni Vyas 64