Modern information retrieval: Week 3 Probabilistic Model.

Modern information retrieval: Week 3Probabilistic Model

Last Time… Boolean model

Based on the notion of sets Documents are retrieved only if they satisfy Boolean

conditions specified in the query Does not impose a ranking on retrieved documents Exact match

Vector space model Based on geometry, the notion of vectors in high

dimensional space Documents are ranked based on their similarity to the

query (ranked retrieval) Best/partial match

Probabilistic Model Views retrieval as an attempt to answer a basic question:

“What is the probability that this document is relevant to this query?”

expressed as:

P(REL|D)

ie. Probability of x given y (Probability that of relevance given a particular document D)

Assumptionsdocument here means the content

representation or description, i.e. surrogate

relevance is binaryrelevance of a document is

independent of relevance of other documents

terms are independent of one another

Statistical Independence A and B are independent if and only if:

P(A and B) = P(A) P(B)

Simplest example: series of coin flips Independence formalizes “unrelated”

P(“being brown eyed”) = 6/10 P(“being a doctor”) = 1/1000 P(“being a brown eyed doctor”)

= P(“being brown eyed”) P(“being a doctor”) = 6/10,000

Dependent Events Suppose:

P(“having a B.S. degree”) = 3/10 P(“being a doctor”) = 1/1000

Would you expect: P(“having a B.S. degree and being a doctor”)

= P(“having a B.S. degree”) P(“being a doctor”) = 3/10,000

Another example: P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“having studied anatomy” | “being a doctor”) = ??

Conditional Probability

A

B

A and B

P(A | B) P(A and B) / P(B)

Event Space

P(A) = prob. of A relative to entire event space

P(A|B) = prob. of A considering that we know B is true

Doctors and AnatomyP(A | B) P(A and B) / P(B)

A = having studied anatomyB = being a doctor

What is P(“having studied anatomy” | “being a doctor”)?

P(“being a doctor”) = 1/1000P(“having studied anatomy”) = 12/1000P(“being a doctor who studied anatomy”) = 1/1000

P(“having studied anatomy” | “being a doctor”) = 1

More on Conditional Probability What if P(A|B) = P(A)?

Is P(A|B) = P(B|A)?

A and B must be statistically independent!

A = having studied anatomyB = being a doctor

P(“having studied anatomy” | “being a doctor”) = 1

P(“being a doctor”) = 1/1000P(“having studied anatomy”) = 12/1000P(“being a doctor who studied anatomy”) = 1/1000

P(“being a doctor” | “having studied anatomy”) = 1/12

If you’re a doctor, you must have studied anatomy…

If you’ve studied anatomy, you’re more likely to be a doctor, but you could also be a biologist, for example

Applying Bayes’ Theorem P(“have disease”) = 0.0001 (0.01%)

P(“test positive” | “have disease”) = 0.99 (99%)

P(“test positive”) = 0.010098Two case:1. You have the disease, and you tested positive2. You don’t have the disease, but you tested positive (error)

P(A|B) = P(A and B)/P(B)=P(B|A)XP(A)/P(B)

Don’t worry!

P(“have disease” | “test positive”)= (0.99)(0.0001) / 0.010098= 0.009804 = 0.9804%

贝叶斯公式设 D1 ， D2 ，……， Dn 为样本空间 S 的一个划分，如果以 P(Di) 表示事件 Di 发生的概率，且

P(Di)>0(i=1 ， 2 ，…， n) 。对于任一事件x ， P(x)>0 ，则有：

http://baike.baidu.com/image/4e83cb62924fbefce7113ab8

Probabilistic Model Objective: to capture the IR problem using a

probabilistic framework Given a user query, there is an ideal answer set

Querying as specification of the properties of this ideal answer set (clustering)

But, what are these properties?

Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)

Improve by iteration

Probabilistic Ranking Principle Given a user query q and a document dj, the

probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

But, how to compute probabilities? what is the sample space?

The Ranking Probabilistic ranking computed as:

sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous

judgement

C

Rd

The Ranking Definition:

wij {0,1} P(R | vec(dj)) : probability that given doc is relevant P(R | vec(dj)) : probability that the given doc is not relevant

sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]~ P(vec(dj) | R) P(vec(dj) | R)

P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

The ranking exampleInitial query ： t1 t2 t4

P(ti|R) and P(ti| R) are listed below.

Term t1 t2 t3 t4 t5

P(ti|R) 0.8 0.9 0.3 0.32 0.15

P(ti| R) 0.3 0.1 0.35 0.33 0.10

For document D1 ： t2 t5

So ： P(D|R)=(1-0.8)*0.9*(1-0.3)*(1-0.32)*0.15

P(D| R)= (1-0.3)*0.1*(1-0.35)*(1-0.33)*0.10

P(D|R)/P(D| R)=4.216

The Initial Ranking Probabilities P(ki | R) and P(ki | R) ?

Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni/N

where ni is the number of docs that contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

Improving the Initial Ranking Let

V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki

Reevaluate estimates: P(ki | R) , P(ki | R) P(ki | R) = Vi/V P(ki | R) = (ni – Vi)/(N-V)

Repeat recursively

Probabilistic Model An initial set of documents is retrieved somehow

User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)

IR system uses this information to refine description of ideal answer set

By repeting this process, it is expected that the description of the ideal answer set will improve

Have always in mind the need to guess at the very beginning the description of the ideal answer set

Description of ideal answer set is modeled in probabilistic terms

Pluses and Minuses Advantages:

Docs ranked in decreasing order of probability of relevance

Disadvantages: need to guess initial estimates for P(ki | R) method does not take into account tf and idf factors

Brief Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

This seems also to be the view of the research community

Comparison With Vector Space Similar in some ways

Terms treated as if they were independent (unigram language model)

Different in others Based on probability rather than similarity Intuitions are probabilistic (processes for generating

text) rather than geometric Details of use of document length and term, document,

and collection frequencies differ

What’s the point? Probabilistic models formalize assumptions

Binary relevance Document independence Term independence

All of which aren’t true! Relevance isn’t binary Documents are often not independent Terms are clearly not independent

But it works!

Extended Boolean Model:

Disadvantages of “Boolean Model” :

No term weight is used

Counterexample: query q=Kx AND Ky.

Documents containing just one term, e,g, Kx is considered as irrelevant as another document containing none of these terms.

No term weight is used The size of the output might be too large or too small


The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu

The idea is to make use of term weight as vector space model.

Strategy: Combine Boolean query with vector space model.

Why not just use Vector Space Model?

Advantages: It is easy for user to provide query.

Fig. Extended Boolean logic considering the space composed of two terms kx and ky only.

dj

dj +1dj +1

dj

kx and ky

kx or ky

( 0, 1) ( 0, 1)( 1, 1) ( 1, 1)

( 0, 0) ( 1, 0) ( 0, 0) ( 1, 0)

• ky • ky

• kx • kx


For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use

to rank the documents The bigger the better.

2),(

22 yxdqsim or


For query q=Kx and Ky, (1,1) is the most desirable point. We use

to rank the documents. The bigger the better.

2

1(1),(

))1( 22 yxdqsim and

Exercise

Documents 0 0 1 1 0 0 0 0 1 1 0 0 0

0 1 1 0 0 0 0 0 0 0 1 1 0

1 0 1 0 1 0 0 1 0 0 0 0 1

1 1 0 0 0 1 1 0 0 0 0 1 0

Terms

adve

ntur

eag

ricul

ture

brid

geca

thed

rals

disa

ster

sfla

gsho

rticu

lture

lepr

osy

Medite

rrane

anre

cipe

ssc

hola

rshi

pste

nnis

Venu

s

Query: bridge tennis

练习给定文档语料 : D1: 北京安立文高新技术公司 D2: 新一代的网络访问技术 D3: 北京卫星网络有限公司 D4: 是最先进的总线技术。。。 D5: 北京升平卫星技术有限公司的新技术有。。。利用中文切分词软件，分别得到用“ /” 分开的一些字词： D1: 北京 / 安 / 立 / 文 / 高新 / 技术 / 公司 / D2: 新 / 一 /

代 / 的 / 网络 / 访问 / 技术 / D3: 北京 / 卫星 / 网络 / 有限 / 公司 / D4: 是 / 最 / 先进 /

的 / 总线 / 技术 / 。。。 D5: 北京 / 升 / 平 / 卫星 / 技术 / 有限 / 公司 / 的 / 新 / 技

术 / 有。。。你的任务是设计一个针对这些文档的信息检索系统。具体要求是： (1). 给出系统的有效词汇集合（说明取舍原因）。 (2). 写出

D1 和 D2 在 VSM 中的表示。 (3). 按照向量夹角的余弦计算公式，给出针对查询“技术的公司”的前 3 个反馈结果。

Modern information retrieval: Week 3 Probabilistic Model.

Documents

Transcript of Modern information retrieval: Week 3 Probabilistic Model.