Modern information retrieval: Week 3 Probabilistic Model.
-
Upload
jayson-mccoy -
Category
Documents
-
view
240 -
download
0
description
Transcript of Modern information retrieval: Week 3 Probabilistic Model.
Modern information retrieval: Week 3Probabilistic Model
Last Time… Boolean model
Based on the notion of sets Documents are retrieved only if they satisfy Boolean
conditions specified in the query Does not impose a ranking on retrieved documents Exact match
Vector space model Based on geometry, the notion of vectors in high
dimensional space Documents are ranked based on their similarity to the
query (ranked retrieval) Best/partial match
Probabilistic Model Views retrieval as an attempt to answer a basic question:
“What is the probability that this document is relevant to this query?”
expressed as:
P(REL|D)
ie. Probability of x given y (Probability that of relevance given a particular document D)
Assumptionsdocument here means the content
representation or description, i.e. surrogate
relevance is binaryrelevance of a document is
independent of relevance of other documents
terms are independent of one another
Statistical Independence A and B are independent if and only if:
P(A and B) = P(A) P(B)
Simplest example: series of coin flips Independence formalizes “unrelated”
P(“being brown eyed”) = 6/10 P(“being a doctor”) = 1/1000 P(“being a brown eyed doctor”)
= P(“being brown eyed”) P(“being a doctor”) = 6/10,000
Dependent Events Suppose:
P(“having a B.S. degree”) = 3/10 P(“being a doctor”) = 1/1000
Would you expect: P(“having a B.S. degree and being a doctor”)
= P(“having a B.S. degree”) P(“being a doctor”) = 3/10,000
Another example: P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“having studied anatomy” | “being a doctor”) = ??
Conditional Probability
A
B
A and B
P(A | B) P(A and B) / P(B)
Event Space
P(A) = prob. of A relative to entire event space
P(A|B) = prob. of A considering that we know B is true
Doctors and AnatomyP(A | B) P(A and B) / P(B)
A = having studied anatomyB = being a doctor
What is P(“having studied anatomy” | “being a doctor”)?
P(“being a doctor”) = 1/1000P(“having studied anatomy”) = 12/1000P(“being a doctor who studied anatomy”) = 1/1000
P(“having studied anatomy” | “being a doctor”) = 1
More on Conditional Probability What if P(A|B) = P(A)?
Is P(A|B) = P(B|A)?
A and B must be statistically independent!
A = having studied anatomyB = being a doctor
P(“having studied anatomy” | “being a doctor”) = 1
P(“being a doctor”) = 1/1000P(“having studied anatomy”) = 12/1000P(“being a doctor who studied anatomy”) = 1/1000
P(“being a doctor” | “having studied anatomy”) = 1/12
If you’re a doctor, you must have studied anatomy…
If you’ve studied anatomy, you’re more likely to be a doctor, but you could also be a biologist, for example
Applying Bayes’ Theorem P(“have disease”) = 0.0001 (0.01%)
P(“test positive” | “have disease”) = 0.99 (99%)
P(“test positive”) = 0.010098Two case:1. You have the disease, and you tested positive2. You don’t have the disease, but you tested positive (error)
P(A|B) = P(A and B)/P(B)=P(B|A)XP(A)/P(B)
Don’t worry!
P(“have disease” | “test positive”)= (0.99)(0.0001) / 0.010098= 0.009804 = 0.9804%
贝叶斯公式 设 D1 , D2 ,……, Dn 为样本空间 S 的一个划分,如果以 P(Di) 表示事件 Di 发生的概率,且
P(Di)>0(i=1 , 2 ,…, n) 。对于任一事件x , P(x)>0 ,则有:
Probabilistic Model Objective: to capture the IR problem using a
probabilistic framework Given a user query, there is an ideal answer set
Querying as specification of the properties of this ideal answer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)
Improve by iteration
Probabilistic Ranking Principle Given a user query q and a document dj, the
probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.
But, how to compute probabilities? what is the sample space?
The Ranking Probabilistic ranking computed as:
sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous
judgement
C
Rd
The Ranking Definition:
wij {0,1} P(R | vec(dj)) : probability that given doc is relevant P(R | vec(dj)) : probability that the given doc is not relevant
sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]~ P(vec(dj) | R) P(vec(dj) | R)
P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents
The Ranking
sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R)
~ [ P(ki | R)] * [ P(kj | R)] [ P(ki | R)] * [ P(kj | R)] ki D kj D∈ ∧ ∉
where P(kj | R) = 1 - P(kj | R) P(kj | R) = 1 - P(kj | R)
P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
The ranking exampleInitial query : t1 t2 t4
P(ti|R) and P(ti| R) are listed below.
Term t1 t2 t3 t4 t5
P(ti|R) 0.8 0.9 0.3 0.32 0.15
P(ti| R) 0.3 0.1 0.35 0.33 0.10
For document D1 : t2 t5
So : P(D|R)=(1-0.8)*0.9*(1-0.3)*(1-0.32)*0.15
P(D| R)= (1-0.3)*0.1*(1-0.35)*(1-0.33)*0.10
P(D|R)/P(D| R)=4.216
The Initial Ranking Probabilities P(ki | R) and P(ki | R) ?
Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni/N
where ni is the number of docs that contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking
Improving the Initial Ranking Let
V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki
Reevaluate estimates: P(ki | R) , P(ki | R) P(ki | R) = Vi/V P(ki | R) = (ni – Vi)/(N-V)
Repeat recursively
Probabilistic Model An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)
IR system uses this information to refine description of ideal answer set
By repeting this process, it is expected that the description of the ideal answer set will improve
Have always in mind the need to guess at the very beginning the description of the ideal answer set
Description of ideal answer set is modeled in probabilistic terms
Pluses and Minuses Advantages:
Docs ranked in decreasing order of probability of relevance
Disadvantages: need to guess initial estimates for P(ki | R) method does not take into account tf and idf factors
Brief Comparison of Classic Models
Boolean model does not provide for partial matches and is considered to be the weakest classic model
Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
This seems also to be the view of the research community
Comparison With Vector Space Similar in some ways
Terms treated as if they were independent (unigram language model)
Different in others Based on probability rather than similarity Intuitions are probabilistic (processes for generating
text) rather than geometric Details of use of document length and term, document,
and collection frequencies differ
What’s the point? Probabilistic models formalize assumptions
Binary relevance Document independence Term independence
All of which aren’t true! Relevance isn’t binary Documents are often not independent Terms are clearly not independent
But it works!
Extended Boolean Model:
Disadvantages of “Boolean Model” :
No term weight is used
Counterexample: query q=Kx AND Ky.
Documents containing just one term, e,g, Kx is considered as irrelevant as another document containing none of these terms.
No term weight is used The size of the output might be too large or too small
Extended Boolean Model:
The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu
The idea is to make use of term weight as vector space model.
Strategy: Combine Boolean query with vector space model.
Why not just use Vector Space Model?
Advantages: It is easy for user to provide query.
Fig. Extended Boolean logic considering the space composed of two terms kx and ky only.
dj
dj +1dj +1
dj
kx and ky
kx or ky
( 0, 1) ( 0, 1)( 1, 1) ( 1, 1)
( 0, 0) ( 1, 0) ( 0, 0) ( 1, 0)
• ky • ky
• kx • kx
Extended Boolean Model:
For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use
to rank the documents The bigger the better.
2),(
22 yxdqsim or
Extended Boolean Model:
For query q=Kx and Ky, (1,1) is the most desirable point. We use
to rank the documents. The bigger the better.
2
1(1),(
))1( 22 yxdqsim and
Exercise
Documents 0 0 1 1 0 0 0 0 1 1 0 0 0
0 1 1 0 0 0 0 0 0 0 1 1 0
1 0 1 0 1 0 0 1 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 1 0
Terms
adve
ntur
eag
ricul
ture
brid
geca
thed
rals
disa
ster
sfla
gsho
rticu
lture
lepr
osy
Medite
rrane
anre
cipe
ssc
hola
rshi
pste
nnis
Venu
s
Query: bridge tennis
练习 给定文档语料 : D1: 北京安立文高新技术公司 D2: 新一代的网络访问技术 D3: 北京卫星网络有限公司 D4: 是最先进的总线技术。。。 D5: 北京升平卫星技术有限公司的新技术有。。。 利用中文切分词软件,分别得到用“ /” 分开的一些字词: D1: 北京 / 安 / 立 / 文 / 高新 / 技术 / 公司 / D2: 新 / 一 /
代 / 的 / 网络 / 访问 / 技术 / D3: 北京 / 卫星 / 网络 / 有限 / 公司 / D4: 是 / 最 / 先进 /
的 / 总线 / 技术 / 。。。 D5: 北京 / 升 / 平 / 卫星 / 技术 / 有限 / 公司 / 的 / 新 / 技
术 / 有。。。 你的任务是设计一个针对这些文档的信息检索系统。具体要求是: (1). 给出系统的有效词汇集合(说明取舍原因)。 (2). 写出
D1 和 D2 在 VSM 中的表示。 (3). 按照向量夹角的余弦计算公式,给出针对查询“技术的公司”的前 3 个反馈结果。