Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal...

Post on 30-Jul-2020

5 views 0 download

Transcript of Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal...

jbollen@cs.odu.edu

Lecture 2:

IR Models.

Johan BollenOld Dominion University

Department of Computer Science

jbollen@cs.odu.edu

http://www.cs.odu.edu/˜jbollen

January 30, 2003 Page 1

jbollen@cs.odu.edu

Structure

1. IR formal characterization

(a) Mathematical framework

(b) Document and keyterm

representations

2. Models

(a) Boolean Model

(b) Vector Space Model

(c) Probabilistic Model

January 30, 2003 Page 2

jbollen@cs.odu.edu

Formal Characterization of IR systemsAn information retrieval model is defined as the quadruple:

[D,Q ,F ,R (qi ,d j )]where

1. D is a set of document representations (logical views)

2. Q is a set of representations constituting the user’s information needs or query

3. F represents the framework for modeling document representations, queries and their

relationships

4. R (qi ,d j )] is a ranking function which associates a number∈ R with a queryqi ∈ Qand a document representationd j ∈D.

January 30, 2003 Page 3

jbollen@cs.odu.edu

Formal Characterization of IR systems

January 30, 2003 Page 4

jbollen@cs.odu.edu

Classic IR concepts

1. Concepts:

(a) Documents represented by in-

dex terms

(b) Index term weights: speci-

ficity for specific document

(c) Same may apply for query

2. Relevant to boolean, vector and

probabilistic models

ki ∈K = {k1, · · · ,kt}: generic index term,

d j : document

wi, j > 0: weight associated with(ki ,d j )

if ki not ind j , wi, j = 0

d j is thus associated with index term vector

~d j = (w1, j ,w2, j , · · · ,wt, j )

functiongi return weight of indexki in ~d j ,

such thatgi(~d j ) = wi, j

January 30, 2003 Page 5

jbollen@cs.odu.edu

ExampleEdgar Allen Poe’s The Conqueror Worm, 1843

Last half of last paragraph:

1) And theangels, all pallid andwan,

2) Uprising, unveiling, affirm3) That theplay is thetragedy, ”Man,”

4) And itshero theConqueror Worm .

Hand selected keyterms:

1) affirm 7) play

2) angel 8) tragedy

3) conqueror 9) unveil

4) hero 10) uprise

5) man 11) wan

6) pallid 12) worm

g5(~d3) = 1,g3(~d2) = 0

January 30, 2003 Page 6

jbollen@cs.odu.edu

Matrix representationw1,1 w2,1 · · · w1,t

w2,1 w2,2 · · · w2,t

· · ·wn,1 wn,1 · · · wn,t

0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = (0,0,0,0,1,0,1,1,0,0,0,0), ~k4 = (0,0,0,1)

1. Keyterm-document, or in this

case document-keyterm, ma-

trix

2. Represents weight values be-

tween keyterms and docu-

ments

3. Usually sparse, and very large

4. Assumption of independence

or orthogonality: keyterms

are not related!

5. In this specific case: weight

values are binary

January 30, 2003 Page 7

jbollen@cs.odu.edu

What are IR models

1. Abstract models of IR techniques:

(a) characteristics

(b) Assumptions

(c) General modus operandi

(d) Properties

2. Do not comprise full blown IR systems

(a) Technical details will be discussedlater

(b) Focus is assumptions and

characteristics

3. Present discussion is not end all be all

of IR

(a) New models can be introduced

(b) Devil is often in details

(c) Large collections require specific

engineering efforts

January 30, 2003 Page 8

jbollen@cs.odu.edu

Boolean Model

1. Basic principles:

(a) User query is shaped as Booleanexpression

(b) Documents are predicted to beeither relevant or not

2. Advantages:

(a) Neat formalism

(b) Easy and efficient implementation

(c) Widespread adoptation has

accustomed users to Boolean

queries

3. Disadvantages:

(a) No document ranking

(b) More like a data retrieval than

information retrieval model

(c) In spite of user preferences, not

straightforward to formulate

adequate queries

January 30, 2003 Page 9

jbollen@cs.odu.edu

Boolean Model: Formalizationwi, j ∈ {0,1}

Queryq consists of keyterms connected byNOT, AND andOR.

For example,q = [ka∧ (kb∨¬kc)]or: q = ka&& (kb ‖!kc)

or: “brothers” AND ( “chemical” OR NOT “doobie” )

For every Boolean statement there exists a Disjunctive Normal Form (DNF):

q = [ka∧ (kb∨¬kc)] = (1,0,0)∨ (1,1,0)∨ (1,1,1)

ka kb kc [ka∧ (kb∨¬kc)]

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 1 ka∧¬kb∧¬kc

1 0 1 0

1 1 0 1 ka∧kb∧¬kc

1 1 1 1 ka∧kb∧kc

January 30, 2003 Page 10

jbollen@cs.odu.edu

DNF

KaKb

Kc

Ka AND !Kb AND !Kc

Ka AND Kb and !Kc

Ka AND Kb AND Kc

January 30, 2003 Page 11

jbollen@cs.odu.edu

Boolean Model: Query matchingLet~qdn f be DNF forq

and~qcc be the conjunctive components of~qdn f

then

sim(d j ,q) =

1 ∃~qcc : (~qcc ∈~qdn f)∧ (∀ki ,gi(~d j ) = gi(~qcc))

0 otherwise

∣∣∣∣∣∣In human language, the similarity between a document and a user query is one, when there

exists a conjunctive component such that every keyterm in the query has the same value in

one of its DNF components as it does in the document (1 = present)

January 30, 2003 Page 12

jbollen@cs.odu.edu

Boolean Model: ExampleQuery: “play” AND (“tragedy” OR NOT “worm”)

translates to DNF:

(“play” AND NOT “tragedy” AND NOT “worm”) = (100)OR (“play” AND “tragedy” AND NOT “worm”) = (110)

OR (“play” AND “tragedy” AND “warm”) = (111)

1) affirm, 2) angel, 3) conqueror, 4) hero, 5) man, 6) pallid,

7) play, 8) tragedy, 9) unveil, 10) uprise, 11) wan, 12) worm0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = That theplay is thetragedy, ”Man,”

January 30, 2003 Page 13

jbollen@cs.odu.edu

Boolean Model: Discussion

1. Very popular because of its simplicityand efficiency

2. Does not rank results

3. Strongly dependent on exact use ofkeyterms

4. Not really IR, but rather data retrieval

(a) keyterms are semantically“shallow” handles on files

(b) User queries are intended to match

completely:

i. problems with jargon

ii. synonymy

iii. user has to be aware of exact

keyterms

iv. humans are not very good at

formulating complicated

Boolean queries

January 30, 2003 Page 14

jbollen@cs.odu.edu

Vector Space Model

1. Keyterm based model

2. Allows partial matching and rankingof documents

3. Basic principles:

(a) Document represented by keytermvector

(b) t dimensional space spanned byset of keyterms

(c) Query represented by keytermvector

(d) Similarity document-keytermcalculated based on vectordistance

4. Requires:

(a) keyterm weighing for document

vectors

(b) keyterm weighing for query

(c) Distance measure for

document-keyterm vectors

5. Features:

(a) Efficient

(b) Successful

(c) Easy to grasp visual representation

(d) Applicable to document-document

matching

January 30, 2003 Page 15

jbollen@cs.odu.edu

Vector Space ModelDefinition:

Let

wi, j be associated with(ki ,d j ), wi, j ∈ R+

wi,q be associated with(ki ,q), wi, j ∈ R+

Query vector~q = (w1,q,w2,q, · · · ,wt,q)t is total number of index terms in system

Documentd is represented by:~d j = (w1, j ,w2, j , · · · ,wt, j ).

Both documentd j and queryq are thus represented as t-dimensional vectors.

The similarity between document and query is evaluated by the cosine of their angle e.g.:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

= ∑ti=1 wi, j×wi,q√

∑ti=1 w2

i, j×√

∑ti=1 w2

i,q

Based on definition of dot-product:

d j ·q = ‖d j‖‖q‖cos(α)

cosine similarity∈ [0,1] sincewi, j ≥ 0 andwi, j ≥ 0

January 30, 2003 Page 16

jbollen@cs.odu.edu

Cosine similarity measure

Q

β

α

k 2

k 1

k 3

2d d1

cos(α) = 0: perpendicular vectors cos(α) = 1: vectors co-linear

January 30, 2003 Page 17

jbollen@cs.odu.edu

Mommy, where do the index term weights comefrom?

1. Goal is to find weight that expresses

how specific a term is to the intended

set of relevant documents

2. Use of frequency

3. Two issues:

(a) term frequency (TF): frequency

within document

(b) document frequency (DF):

frequency within collection

4. Term frequency

(a) Expresses connection between

term and specific document

(b) Term often occurs within

document indicative of its

meaning

5. Document Frequency

(a) Expresses general frequency of

term, e.g. “the”

(b) Characteristic of language

6. Aim is to find keyterms which balance

high TF and relatively low DF

January 30, 2003 Page 18

jbollen@cs.odu.edu

DefinitionTerm frequency

Let N be total number of documents in collection

andni number of documents in which index termki appears.

freqi, j is frequency ofki in d j .

Normalized term frequency is then defined as:

fi, j =freqi, j

maxl freql , j

where maxl freql , j is determined over all terms in documentd j .

Inverse document frequency

idf i = log Nni

Index term weight

wi, j = fi, j × log Nni

January 30, 2003 Page 19

jbollen@cs.odu.edu

Query keyterm weightsBalance DF and query frequency:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

ni

so that freqi,q is frequency of term in query (usually 1)

Document-keyterm matrix:

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10

d1 w1,1 w1,2 · · · w1,10

d2 w2,1 w2,2 · · · w2,10

d3 w3,1 w3,2 · · · w3,10

d4 w4,1 w4,2 · · · w4,10

January 30, 2003 Page 20

jbollen@cs.odu.edu

TFIDF:

January 30, 2003 Page 21

jbollen@cs.odu.edu

ExampleEdgar Allen Poe’s The Raven (1845)

First paragraph:

1) Once upon amidnight dreary, while I pondered, weak and weary,

2) Over many a quaint and curiousvolumeof forgottenlore,

3) While I nodded, nearly napping, suddenly there came atapping,

4) As of some one gently rapping, rapping at mychamber door.5) “’Tis somevisiter,” I muttered, ”tapping at mychamber door -

6) Only this, andnothing more.”

Hand selected keyterms - document frequency:1) chamber-2 , 2) door-2, 3) lore-1, 4) midnight-1,

5) nothing-1, 6) tap-1, 7) visiter-1, 8) volume-1

document-keyterm matrix:

6×8 matrix whose entrieswi, j ∈ R+

January 30, 2003 Page 22

jbollen@cs.odu.edu

Example: Document-Keyterm Matrixchamber door lore midnight nothing tap visiter volume

d1 w1,1 w1,2 · · · w1,8

d2 w2,1 w2,2 · · · w2,8

d3 w3,1 w3,2 · · · w3,8

d4 w4,1 w4,2 · · · w4,8

d5 w4,1 w4,2 · · · w5,8

d6 w4,1 w4,2 · · · w6,8

k1 = “chamber”.

Frequency is zero except ford4 andd5.

w1,4 = fi, j × log Nni

or

w1,4 =freqi, j

maxl freql , j× log N

ni

N = 6, n1 = 2, freqi, j =1, maxl freql , j = 1,

w1,4 = 11 × log10

62 = 0.47

January 30, 2003 Page 23

jbollen@cs.odu.edu

Example: Document-Keyterm MatrixResult:

chamber door lore midnight nothing tap visiter volume

d1 0 0 0 0.78 0 0 0 0

d2 0 0 0.78 0 0 0 0 0.78

d3 0 0 0 0 0 0.78 0 0

d4 0.48 0.48 0 0 0 0 0 0

d5 0.48 0.48 0 0 0 0 0.78 0

d6 0 0 0 0 0.78 0 0 0

Note:

Weight values for chamber and door are lower because less specific to document.

January 30, 2003 Page 24

jbollen@cs.odu.edu

Example: QueryQuery:

“Visitor at your door”Translation to known keyterms:

{ “visiter”, “door” }

Query weights:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

ni

“visitor” w1,q =(0.5+ 0.5×1

1

)× log 6

1 = 0.195

“door” w1,q =(0.5+ 0.5×1

1

)× log 6

2 = 0.119

Resulting query vector(0,0.119,0,0,0,0,0.195,0)

January 30, 2003 Page 25

jbollen@cs.odu.edu

RetrievalQuery - Document similarity:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

Procedure:

1. Calculate Numerator (inner product ofq and document vector) for all documents

(a) Weight matrix - query vector multiplication = vector of inner products

(b) Store resulting vector

2. Calculate Denominator (norm ofq and document vectors)

3. Determine similarities

(a) Normalize vector of inner products by determined vector norms

(b) Rank documents according to values in resulting vector

January 30, 2003 Page 26

jbollen@cs.odu.edu

ExampleInner Products:

0 0 0 0.78 0 0 0 0

0 0 0.78 0 0 0 0 0.78

0 0 0 0 0 0.78 0 0

0.48 0.48 0 0 0 0 0 0

0.48 0.48 0 0 0 0 0.78 0

0 0 0 0 0.78 0 0 0

×

0

0.12

0

0

0

0

0.2

0

=

0

0

0

0.06

0.21

0

January 30, 2003 Page 27

jbollen@cs.odu.edu

ExampleNorm of query vector:√

(|~q| ·~qt) = 0.233

Norms of document vectors:|~d1|= 0.608, etc.

(0.78,1.103,0.78,0.679,1.034,0.78)

Inner product document-query vectors:

0

0

0

0.06

0.21

0

→ normalize→

0

0

0

0.379

0.871

0

→ rank→

rank doc

1 d5

2 d6

· · ·

January 30, 2003 Page 28

jbollen@cs.odu.edu

Document to document similaritiesVector space model can be used to calculate similarities between documents in collection.

Instead of query to document vector similarty, document to document.

Procedure:

1. Rather than operating on query to keyterm, document-document vector

2. Use document-keyterm matrix

(a) Document vector in our example: row vectors

(b) Pairwise cosine similarity

3. Procedure is very expensive

(a) Can be optimized for sparse matrices

(b) Efficient in comparison to other methods

In mathematical terms:Let M be document-keyterm matrix.

Matrix of inner productsMi = M×Mt

Norms ofM’s row vectors. etc.

We define sim(di ,d j ) = 1

January 30, 2003 Page 29

jbollen@cs.odu.edu

Document to document example

Mi = M×Mt =

0.608 0 0 0 0 0

0 1.217 0 0 0 0

0 0 0.608 0 0 0

0 0 0 0.46 0.46 0

0 0 0 0.46 1.069 0

0 0 0 0 0 0.608

Norms of matrix row vectors (denominator for calculation of weight values):

(0.78,1.103,0.78,0.679,1.034,0.78)

Normalize matrix values:

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0.655 0

0 0 0 0.655 1 0

0 0 0 0 0 1

We find document 4 and 5 to be similar

January 30, 2003 Page 30

jbollen@cs.odu.edu

Sparse matrices

1. Document-keyterm anddocument-document matrices arenecessarily sparse

(a) Low ratio of non-zero entries overall entries

(b) Expressed as percentage, e.g. 3%density

2. Sparseness is natural

(a) Possible matrix entriesn2

(b) Few “relationships” can besemantically valid

(c) Large, heterogeneous document

collections

(d) Many matrices well below 1%

3. Storage and manipulation techniques

(a) Coordinate system

(b) Compressed Column and

Compressed Row Format

(c) Harwell-Boeing format

(d) Formats only store non-zero

entries

i. dense format:n2

ii. Harwell-Boeing:±n+1+2nz

January 30, 2003 Page 31

jbollen@cs.odu.edu

Coordinate system

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

row 2 5 1 4 1 2 5 3

col 1 1 2 2 3 4 4 5

values 1 3 2 1 5 3 2 2

January 30, 2003 Page 32

jbollen@cs.odu.edu

Compressed Column format or Harwell-BoeingFormat

1. Slightly more efficient than coordinate system (about 30% savings)

2. Represent an×mmatrix by three arrays:

(a) Column pointers:m+1 values

(b) Row indexes:nzvalues

(c) Values (non-zero entries)nzvalues

3. Values are retrieved:

(a) Colum pointers lookup

(b) Row index range lookup

(c) Value lookup

January 30, 2003 Page 33

jbollen@cs.odu.edu

Harwell-Boeing Format

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

colptr 1 3 5 6 8 9

rowindex 2 5 1 4 1 2 5 3

values 1 3 2 1 5 3 2 2

m(1,3) =?

1. Specific:

(a) colptr[3] < rowindex<

colptr[3+1] = 5

(b) rowindex[5] = 1,

(c) corresponding value = 5

2. General

(a) value[i][j] = val[colptr[j] ≤rowindex< colptr[j+1], i]

(b) note: colptr specifies a range of

values in rowindex

(c) colptr: value index of first value to

start a new column

January 30, 2003 Page 34

jbollen@cs.odu.edu

Perl sparse matrix example# ! / u s r / b in / p e r l

# i n d e x e s s t a r t a t ze ro@colp t r = ( 0 , 2 , 4 , 5 , 7 , 8 ) ;@rowindex = ( 1 , 4 , 0 , 3 , 0 , 1 , 4 , 2 ) ;@values = ( 1 , 3 , 2 , 1 , 5 , 3 , 2 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$ r = $ c o l p t r [ $ j +1]− $ c o l p t r [ $ j ] ;

$va lue = 0 ;f o r ( $n= $ c o l p t r [ $ j ] ; $n<$ c o l p t r [ $ j + 1 ] ; $n ++){

i f ( $rowindex [ $n ] = = $ i ){$va lue = $ v a l u e s [ $n ] ;

}}

p r i n t ” v a l u e : $va lue \n” ;

January 30, 2003 Page 34

jbollen@cs.odu.edu

Perl sparse matrix - Compressed Row Storage (CRS)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$va lue = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

i f ( $ c o l i n d e x [ $n ] = = $ j ){$va lue = $ v a l u e s [ $n ] ;

}}

p r i n t ” v a l u e : $va lue \n” ;

January 30, 2003 Page 34

jbollen@cs.odu.edu

Perl sparse matrix - vector multiplication)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@vec = ( 1 , 2 , 3 , 4 , 5 ) ;@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;

$va lue = 0 ;

f o r ( $ i = 0 ; $ i <5; $ i ++){$sum = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

$sum + = $ v a l u e s [ $n ]∗ $vec [ $ c o l i n d e x [ $n ] ] ;}p r i n t $sum . ”\n” ;

}

January 30, 2003 Page 34

jbollen@cs.odu.edu

Vector Space Model conclusion

1. Long History: since 1960s

2. Very efficient

(a) Use of sparse matrix methods

(b) Simple linear algebra

(c) Well proven

3. Flexible:

(a) Use in query resolution

(b) Document to document similarity

(c) Clustering (more later)

4. For these reasons, very popular and

often used

5. Disadvantages:

(a) No clearly defined theoretical

framework

(b) Generation of adequate indexes

(c) Assumption of index term

independence

(d) Scalability?

January 30, 2003 Page 35

jbollen@cs.odu.edu

The use of probability theory in IR

1. IR is all about uncertainty

(a) Give user query, will results berelevant or not?

(b) System may make estimate, butcertainty can not be obtained

(c) Probability of relevancy is issue

2. Baysesian probability theory:

(a) relates to the collection ofevidence and making informed

guesses

(b) evidence: was document

containing certain keyterms

relevant?

(c) evidence: was document

containing certain keyterms found

not relevant?

(d) Adjust estimate of probability not

on the basis of frequency, but on

the basis of evidence

January 30, 2003 Page 36

jbollen@cs.odu.edu

Probabilistic Models

1. Attempts to predict page relevance onprobabilistic grounds

2. Rather than compare specific pagefeatures (index terms)

3. Provide a clear and well-groundedframework for IR

(a) Based on statistical principles:very appropriate to IR!

(b) Relevancy odds can be updated totake into account other factorsthan text content

(c) Possibility of user feedback to be

fed into system

4. Basic idea:

(a) Query is associated with ideal

answer set

(b) Properties of answer→ index

terms

(c) Initial guess at answer set

propertiesrightarrow

probabilistic description

(d) Initial retrieval results

(e) User feedback can finetune

relevancy probabilities

January 30, 2003 Page 37

jbollen@cs.odu.edu

Bayes’ rulehttp://www.cs.berkeley.edu/~murphyk/Bayes/economist.html

Bayesian probability:

Change your beliefs on the basis of evidence.

Example 1:Assume two eventsa (rain) andb (clouds).P(a|b) = P(b|a)×P(a)

P(b)P(a|b) probability it will rain we see clouds

P(b|a) how often have we seen clouds when it rains?

P(a) how often does it rain?

P(b) how often have we seen clouds?

Example 2:a→ ”Person has cancer”

b→ ”Person is a smoker”

P(a|b) = ?

In Belgium: P(b) = 0.3.

Among those diagnosed with cancer, about 80% smokes, soP(b|a) = 0.8.

Overall probability of cancerP(a) = 0.05P(a|b) = P(b|a)×P(a)P(b) = 0.8×0.05

0.3 = 0.13

Quit now!

January 30, 2003 Page 38

jbollen@cs.odu.edu

Some basic notions of applying probability theoryto IR

Can we do the same for IR?d j is document in collection.

R represents set of relevant documents.

R represents set of non-relevant documents.

Then we can say things like:

P(d j |R) = probability that if a document is relevant, it isd j

We need to determineP(R|d j ) from:

1. P(R)

2. P(d j |C)

Since queries are specified we want to translate these statements, to index term based

statements, for example:

P(ki |R) = probability that document containingki is relevant.

Great! So how do we determine all these probabilities?

January 30, 2003 Page 39

jbollen@cs.odu.edu

Probabilistic Models: FormalizationBasic probabilistic notionswi, j ∈ {0,1}, wi,q ∈ {0,1}

R is the set of all relevant documents.

R is R’s complement, i.e. the set of all non-relevant documents.

P(R|~d j ) is the probability thatd j is relevant to queryq or probability of relevancygivend j .

P(R|~d j ) is the probability thatd j is not relevant toq.

sim(d j ,q) = P(R|~d j )P(R|~d j )

Bayes’ rule:P(r|e) = P(e|r)×P(R)P(e) :

sim(d j,q) = P(~d j |R)×P(R)P(~d j |R)×P(R)

P(~d j |R) : probability of selectiond j from set of relevant documents.

P(R): probability that randomly selected document is relevant. As book mentions:P(R) and

P(R) are constant for a given collection, so:

sim(d j ,q)∼ P(~d j |R)P(~d j |R)

January 30, 2003 Page 40

jbollen@cs.odu.edu

Probabilistic Models: Formalization (Continued)Moving to keyterm level:

sim(d j ,q)∼(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

whereP(ki |R) stands for the probability thatki ∈ d j selected from R and

P(ki |R) stands for the probability thatki /∈ ind j selected from R.

P(ki |R) stands for our probability to make a mistake, etc.

Finally the canonical probabilistic information retrieval equation:

sim(d j ,q)∼ ∑ti=1 wi,q×wi, j ×

(log P(ki |R)

1−P(ki |R) + log 1−P(ki |R)P(ki |R)

)Does this work? No! We need an initial estimate ofP(ki |R) and P(ki |R)

Various schemes.

P(ki |R) = 0.5

P(ki |R) = niN

Givenwi,q andwi,q these estimates allow an initial set of documents to be retrieved and

ranked.

January 30, 2003 Page 41

jbollen@cs.odu.edu

Updating our chancesA bit of recursive reasoning:

Let V be a subset of documents initially retrieved and ranked (e.g. rank above threshold).

Let Vi ⊂V of documents containing documentki .

We can now improve our guesses forP(ki |R) andP(ki |R):

P(k|R) =Vi

V

P(ki |R) =ni −Vi

N−V

So,P(k|R) is updated according to how many of the retrieved set of documents containski .

Similarly, the documents not retrieved indicate our chances of a document containingki not

being relevant.

User assistance can be used to update odds, e.g. user can indicate which documents are

relevant.

January 30, 2003 Page 42

jbollen@cs.odu.edu

Updating our chances, contd.Problem when retrieved set is small:

Adjustment factors:

P(k|R) =Vi +0.5V +1

P(ki |R) =ni −Vi +0.5N−V +1

This factor increases the “importance” of small retrieval sets.

Constant 0.5 factor may not always be desireable:

P(k|R) =Vi +

niN

V +1

P(ki |R) =ni −Vi +

niN

N−V +1

so that adjustment is is relative to overall number of documents containingki .

January 30, 2003 Page 43

jbollen@cs.odu.edu

Conclusion

1. Boolean Model:

(a) Binary index term weights

(b) Disjunctive Normal Factors

(c) No partial matching

(d) No ranking

(e) Simple and very efficient

2. Vector space model

(a) Index term weights:

i. Term Frequencyii. Inverse Document Frequency

(b) Query and document vector

(c) Cosine measure ofquery-document similarity

3. Probabilistic Model:

(a) Based on probabilistic estimate of

relevancy document given certain

keyterms

(b) Allows system to adapt to users

and provides strong formal

framework for retrieval

(c) Rather complicated

(d) Performance:

i. Thought to be less efficient

than vector space model

ii. Problematic initial fine-tuning

phase

iii. Probability: semantics may be

issue

January 30, 2003 Page 44

jbollen@cs.odu.edu

How do they stack up?

1. Boolean:

(a) Pro:

i. Efficientii. Easy to implement

iii. Prefered by users in spite ofinability to formulate adequatequeries

(b) Con:

i. No rankingii. No partial matching

iii. Entirely dependent on keytermdefinition and binary weights

2. Vector Space Model:

(a) Pro:

i. Keyterm weighing

ii. Partial Matching

iii. Ranking

iv. Efficient

v. Easy to understand framework

vi. Flexibility

(b) Con:

i. Little formal framework

ii. Assumption of keyterm

independence

3. Probabilistic Model:

(a) Compares poorly to Vector Space

Model

(b) Occam’s razor may apply to IR as

well

(c) Case for many novel models

January 30, 2003 Page 45

jbollen@cs.odu.edu

ReadingsSalton (1988) Term weighting approaches in automatic text retrieval

Wildemuth (1998) Full Text Available Hypertext versus Boolean access to biomedicalinformation: a comparison of effectiveness, efficiency, and user preferences

http://www.cs.odu.edu/~jbollen/spring03_IR/readings/

January 30, 2003 Page 46