Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal...

jbollen@cs.odu.edu

Lecture 2:

IR Models.

Johan BollenOld Dominion University

Department of Computer Science

jbollen@cs.odu.edu

http://www.cs.odu.edu/˜jbollen

January 30, 2003 Page 1

jbollen@cs.odu.edu

Structure

1. IR formal characterization

(a) Mathematical framework

(b) Document and keyterm

representations

2. Models

(a) Boolean Model

(b) Vector Space Model

(c) Probabilistic Model

jbollen@cs.odu.edu

Formal Characterization of IR systemsAn information retrieval model is defined as the quadruple:

[D,Q ,F ,R (qi ,d j )]where

1. D is a set of document representations (logical views)

2. Q is a set of representations constituting the user’s information needs or query

3. F represents the framework for modeling document representations, queries and their

relationships

4. R (qi ,d j )] is a ranking function which associates a number∈ R with a queryqi ∈ Qand a document representationd j ∈D.

jbollen@cs.odu.edu

Formal Characterization of IR systems

jbollen@cs.odu.edu

Classic IR concepts

1. Concepts:

(a) Documents represented by in-

dex terms

(b) Index term weights: speci-

ficity for specific document

(c) Same may apply for query

2. Relevant to boolean, vector and

probabilistic models

ki ∈K = {k1, · · · ,kt}: generic index term,

d j : document

wi, j > 0: weight associated with(ki ,d j )

if ki not ind j , wi, j = 0

d j is thus associated with index term vector

~d j = (w1, j ,w2, j , · · · ,wt, j )

functiongi return weight of indexki in ~d j ,

such thatgi(~d j ) = wi, j

jbollen@cs.odu.edu

ExampleEdgar Allen Poe’s The Conqueror Worm, 1843

Last half of last paragraph:

1) And theangels, all pallid andwan,

2) Uprising, unveiling, affirm3) That theplay is thetragedy, ”Man,”

4) And itshero theConqueror Worm .

Hand selected keyterms:

1) affirm 7) play

2) angel 8) tragedy

3) conqueror 9) unveil

4) hero 10) uprise

5) man 11) wan

6) pallid 12) worm

g5(~d3) = 1,g3(~d2) = 0

jbollen@cs.odu.edu

Matrix representationw1,1 w2,1 · · · w1,t

w2,1 w2,2 · · · w2,t

· · ·wn,1 wn,1 · · · wn,t

0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = (0,0,0,0,1,0,1,1,0,0,0,0), ~k4 = (0,0,0,1)

1. Keyterm-document, or in this

case document-keyterm, ma-

2. Represents weight values be-

tween keyterms and docu-

3. Usually sparse, and very large

4. Assumption of independence

or orthogonality: keyterms

are not related!

5. In this specific case: weight

values are binary

jbollen@cs.odu.edu

What are IR models

1. Abstract models of IR techniques:

(a) characteristics

(b) Assumptions

(c) General modus operandi

(d) Properties

2. Do not comprise full blown IR systems

(a) Technical details will be discussedlater

(b) Focus is assumptions and

characteristics

3. Present discussion is not end all be all

(a) New models can be introduced

(b) Devil is often in details

(c) Large collections require specific

engineering efforts

jbollen@cs.odu.edu

Boolean Model

1. Basic principles:

(a) User query is shaped as Booleanexpression

(b) Documents are predicted to beeither relevant or not

2. Advantages:

(a) Neat formalism

(b) Easy and efficient implementation

(c) Widespread adoptation has

accustomed users to Boolean

queries

3. Disadvantages:

(a) No document ranking

(b) More like a data retrieval than

information retrieval model

(c) In spite of user preferences, not

straightforward to formulate

adequate queries

jbollen@cs.odu.edu

Boolean Model: Formalizationwi, j ∈ {0,1}

Queryq consists of keyterms connected byNOT, AND andOR.

For example,q = [ka∧ (kb∨¬kc)]or: q = ka&& (kb ‖!kc)

or: “brothers” AND ( “chemical” OR NOT “doobie” )

For every Boolean statement there exists a Disjunctive Normal Form (DNF):

q = [ka∧ (kb∨¬kc)] = (1,0,0)∨ (1,1,0)∨ (1,1,1)

ka kb kc [ka∧ (kb∨¬kc)]

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 1 ka∧¬kb∧¬kc

1 0 1 0

1 1 0 1 ka∧kb∧¬kc

1 1 1 1 ka∧kb∧kc

jbollen@cs.odu.edu

Ka AND !Kb AND !Kc

Ka AND Kb and !Kc

Ka AND Kb AND Kc

jbollen@cs.odu.edu

Boolean Model: Query matchingLet~qdn f be DNF forq

and~qcc be the conjunctive components of~qdn f

sim(d j ,q) =

1 ∃~qcc : (~qcc ∈~qdn f)∧ (∀ki ,gi(~d j ) = gi(~qcc))

0 otherwise

∣∣∣∣∣∣In human language, the similarity between a document and a user query is one, when there

exists a conjunctive component such that every keyterm in the query has the same value in

one of its DNF components as it does in the document (1 = present)

jbollen@cs.odu.edu

Boolean Model: ExampleQuery: “play” AND (“tragedy” OR NOT “worm”)

translates to DNF:

(“play” AND NOT “tragedy” AND NOT “worm”) = (100)OR (“play” AND “tragedy” AND NOT “worm”) = (110)

OR (“play” AND “tragedy” AND “warm”) = (111)

1) affirm, 2) angel, 3) conqueror, 4) hero, 5) man, 6) pallid,

7) play, 8) tragedy, 9) unveil, 10) uprise, 11) wan, 12) worm0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = That theplay is thetragedy, ”Man,”

jbollen@cs.odu.edu

Boolean Model: Discussion

1. Very popular because of its simplicityand efficiency

2. Does not rank results

3. Strongly dependent on exact use ofkeyterms

4. Not really IR, but rather data retrieval

(a) keyterms are semantically“shallow” handles on files

(b) User queries are intended to match

completely:

i. problems with jargon

ii. synonymy

iii. user has to be aware of exact

keyterms

iv. humans are not very good at

formulating complicated

Boolean queries

jbollen@cs.odu.edu

Vector Space Model

1. Keyterm based model

2. Allows partial matching and rankingof documents

3. Basic principles:

(a) Document represented by keytermvector

(b) t dimensional space spanned byset of keyterms

(c) Query represented by keytermvector

(d) Similarity document-keytermcalculated based on vectordistance

4. Requires:

(a) keyterm weighing for document

vectors

(b) keyterm weighing for query

(c) Distance measure for

document-keyterm vectors

5. Features:

(a) Efficient

(b) Successful

(c) Easy to grasp visual representation

(d) Applicable to document-document

matching

jbollen@cs.odu.edu

Vector Space ModelDefinition:

wi, j be associated with(ki ,d j ), wi, j ∈ R+

wi,q be associated with(ki ,q), wi, j ∈ R+

Query vector~q = (w1,q,w2,q, · · · ,wt,q)t is total number of index terms in system

Documentd is represented by:~d j = (w1, j ,w2, j , · · · ,wt, j ).

Both documentd j and queryq are thus represented as t-dimensional vectors.

The similarity between document and query is evaluated by the cosine of their angle e.g.:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

= ∑ti=1 wi, j×wi,q√

∑ti=1 w2

i, j×√

∑ti=1 w2

Based on definition of dot-product:

d j ·q = ‖d j‖‖q‖cos(α)

cosine similarity∈ [0,1] sincewi, j ≥ 0 andwi, j ≥ 0

jbollen@cs.odu.edu

Cosine similarity measure

cos(α) = 0: perpendicular vectors cos(α) = 1: vectors co-linear

jbollen@cs.odu.edu

Mommy, where do the index term weights comefrom?

1. Goal is to find weight that expresses

how specific a term is to the intended

set of relevant documents

2. Use of frequency

3. Two issues:

(a) term frequency (TF): frequency

within document

(b) document frequency (DF):

frequency within collection

4. Term frequency

(a) Expresses connection between

term and specific document

(b) Term often occurs within

document indicative of its

meaning

5. Document Frequency

(a) Expresses general frequency of

term, e.g. “the”

(b) Characteristic of language

6. Aim is to find keyterms which balance

high TF and relatively low DF

jbollen@cs.odu.edu

DefinitionTerm frequency

Let N be total number of documents in collection

andni number of documents in which index termki appears.

freqi, j is frequency ofki in d j .

Normalized term frequency is then defined as:

fi, j =freqi, j

maxl freql , j

where maxl freql , j is determined over all terms in documentd j .

Inverse document frequency

idf i = log Nni

Index term weight

wi, j = fi, j × log Nni

jbollen@cs.odu.edu

Query keyterm weightsBalance DF and query frequency:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

so that freqi,q is frequency of term in query (usually 1)

Document-keyterm matrix:

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10

d1 w1,1 w1,2 · · · w1,10

d2 w2,1 w2,2 · · · w2,10

d3 w3,1 w3,2 · · · w3,10

d4 w4,1 w4,2 · · · w4,10

jbollen@cs.odu.edu

TFIDF:

jbollen@cs.odu.edu

ExampleEdgar Allen Poe’s The Raven (1845)

First paragraph:

1) Once upon amidnight dreary, while I pondered, weak and weary,

2) Over many a quaint and curiousvolumeof forgottenlore,

3) While I nodded, nearly napping, suddenly there came atapping,

4) As of some one gently rapping, rapping at mychamber door.5) “’Tis somevisiter,” I muttered, ”tapping at mychamber door -

6) Only this, andnothing more.”

Hand selected keyterms - document frequency:1) chamber-2 , 2) door-2, 3) lore-1, 4) midnight-1,

5) nothing-1, 6) tap-1, 7) visiter-1, 8) volume-1

document-keyterm matrix:

6×8 matrix whose entrieswi, j ∈ R+

jbollen@cs.odu.edu

Example: Document-Keyterm Matrixchamber door lore midnight nothing tap visiter volume

d1 w1,1 w1,2 · · · w1,8

d2 w2,1 w2,2 · · · w2,8

d3 w3,1 w3,2 · · · w3,8

d4 w4,1 w4,2 · · · w4,8

d5 w4,1 w4,2 · · · w5,8

d6 w4,1 w4,2 · · · w6,8

k1 = “chamber”.

Frequency is zero except ford4 andd5.

w1,4 = fi, j × log Nni

w1,4 =freqi, j

maxl freql , j× log N

N = 6, n1 = 2, freqi, j =1, maxl freql , j = 1,

w1,4 = 11 × log10

62 = 0.47

jbollen@cs.odu.edu

Example: Document-Keyterm MatrixResult:

chamber door lore midnight nothing tap visiter volume

d1 0 0 0 0.78 0 0 0 0

d2 0 0 0.78 0 0 0 0 0.78

d3 0 0 0 0 0 0.78 0 0

d4 0.48 0.48 0 0 0 0 0 0

d5 0.48 0.48 0 0 0 0 0.78 0

d6 0 0 0 0 0.78 0 0 0

Weight values for chamber and door are lower because less specific to document.

jbollen@cs.odu.edu

Example: QueryQuery:

“Visitor at your door”Translation to known keyterms:

{ “visiter”, “door” }

Query weights:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

“visitor” w1,q =(0.5+ 0.5×1

)× log 6

1 = 0.195

“door” w1,q =(0.5+ 0.5×1

)× log 6

2 = 0.119

Resulting query vector(0,0.119,0,0,0,0,0.195,0)

jbollen@cs.odu.edu

RetrievalQuery - Document similarity:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

Procedure:

1. Calculate Numerator (inner product ofq and document vector) for all documents

(a) Weight matrix - query vector multiplication = vector of inner products

(b) Store resulting vector

2. Calculate Denominator (norm ofq and document vectors)

3. Determine similarities

(a) Normalize vector of inner products by determined vector norms

(b) Rank documents according to values in resulting vector

jbollen@cs.odu.edu

ExampleInner Products:

0 0 0 0.78 0 0 0 0

0 0 0.78 0 0 0 0 0.78

0 0 0 0 0 0.78 0 0

0.48 0.48 0 0 0 0 0 0

0.48 0.48 0 0 0 0 0.78 0

0 0 0 0 0.78 0 0 0

jbollen@cs.odu.edu

ExampleNorm of query vector:√

(|~q| ·~qt) = 0.233

Norms of document vectors:|~d1|= 0.608, etc.

(0.78,1.103,0.78,0.679,1.034,0.78)

Inner product document-query vectors:

→ normalize→

→ rank→

rank doc

· · ·

jbollen@cs.odu.edu

Document to document similaritiesVector space model can be used to calculate similarities between documents in collection.

Instead of query to document vector similarty, document to document.

Procedure:

1. Rather than operating on query to keyterm, document-document vector

2. Use document-keyterm matrix

(a) Document vector in our example: row vectors

(b) Pairwise cosine similarity

3. Procedure is very expensive

(a) Can be optimized for sparse matrices

(b) Efficient in comparison to other methods

In mathematical terms:Let M be document-keyterm matrix.

Matrix of inner productsMi = M×Mt

Norms ofM’s row vectors. etc.

We define sim(di ,d j ) = 1

jbollen@cs.odu.edu

Document to document example

Mi = M×Mt =

0.608 0 0 0 0 0

0 1.217 0 0 0 0

0 0 0.608 0 0 0

0 0 0 0.46 0.46 0

0 0 0 0.46 1.069 0

0 0 0 0 0 0.608

Norms of matrix row vectors (denominator for calculation of weight values):

(0.78,1.103,0.78,0.679,1.034,0.78)

Normalize matrix values:

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0.655 0

0 0 0 0.655 1 0

0 0 0 0 0 1

We find document 4 and 5 to be similar

jbollen@cs.odu.edu

Sparse matrices

1. Document-keyterm anddocument-document matrices arenecessarily sparse

(a) Low ratio of non-zero entries overall entries

(b) Expressed as percentage, e.g. 3%density

2. Sparseness is natural

(a) Possible matrix entriesn2

(b) Few “relationships” can besemantically valid

(c) Large, heterogeneous document

collections

(d) Many matrices well below 1%

3. Storage and manipulation techniques

(a) Coordinate system

(b) Compressed Column and

Compressed Row Format

(c) Harwell-Boeing format

(d) Formats only store non-zero

entries

i. dense format:n2

ii. Harwell-Boeing:±n+1+2nz

jbollen@cs.odu.edu

Coordinate system

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

row 2 5 1 4 1 2 5 3

col 1 1 2 2 3 4 4 5

values 1 3 2 1 5 3 2 2

jbollen@cs.odu.edu

Compressed Column format or Harwell-BoeingFormat

1. Slightly more efficient than coordinate system (about 30% savings)

2. Represent an×mmatrix by three arrays:

(a) Column pointers:m+1 values

(b) Row indexes:nzvalues

(c) Values (non-zero entries)nzvalues

3. Values are retrieved:

(a) Colum pointers lookup

(b) Row index range lookup

(c) Value lookup

jbollen@cs.odu.edu

Harwell-Boeing Format

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

colptr 1 3 5 6 8 9

rowindex 2 5 1 4 1 2 5 3

values 1 3 2 1 5 3 2 2

m(1,3) =?

1. Specific:

(a) colptr[3] < rowindex<

colptr[3+1] = 5

(b) rowindex[5] = 1,

(c) corresponding value = 5

2. General

(a) value[i][j] = val[colptr[j] ≤rowindex< colptr[j+1], i]

(b) note: colptr specifies a range of

values in rowindex

(c) colptr: value index of first value to

start a new column

jbollen@cs.odu.edu

Perl sparse matrix example# ! / u s r / b in / p e r l

# i n d e x e s s t a r t a t ze ro@colp t r = ( 0 , 2 , 4 , 5 , 7 , 8 ) ;@rowindex = ( 1 , 4 , 0 , 3 , 0 , 1 , 4 , 2 ) ;@values = ( 1 , 3 , 2 , 1 , 5 , 3 , 2 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$ r = $ c o l p t r [ $ j +1]− $ c o l p t r [ $ j ] ;

$va lue = 0 ;f o r ( $n= $ c o l p t r [ $ j ] ; $n<$ c o l p t r [ $ j + 1 ] ; $n ++){

i f ( $rowindex [ $n ] = = $ i ){$va lue = $ v a l u e s [ $n ] ;

p r i n t ” v a l u e : $va lue \n” ;

jbollen@cs.odu.edu

Perl sparse matrix - Compressed Row Storage (CRS)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$va lue = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

i f ( $ c o l i n d e x [ $n ] = = $ j ){$va lue = $ v a l u e s [ $n ] ;

p r i n t ” v a l u e : $va lue \n” ;

jbollen@cs.odu.edu

Perl sparse matrix - vector multiplication)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@vec = ( 1 , 2 , 3 , 4 , 5 ) ;@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;

$va lue = 0 ;

f o r ( $ i = 0 ; $ i <5; $ i ++){$sum = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

$sum + = $ v a l u e s [ $n ]∗ $vec [ $ c o l i n d e x [ $n ] ] ;}p r i n t $sum . ”\n” ;

jbollen@cs.odu.edu

Vector Space Model conclusion

1. Long History: since 1960s

2. Very efficient

(a) Use of sparse matrix methods

(b) Simple linear algebra

(c) Well proven

3. Flexible:

(a) Use in query resolution

(b) Document to document similarity

(c) Clustering (more later)

4. For these reasons, very popular and

often used

5. Disadvantages:

(a) No clearly defined theoretical

framework

(b) Generation of adequate indexes

(c) Assumption of index term

independence

(d) Scalability?

jbollen@cs.odu.edu

The use of probability theory in IR

1. IR is all about uncertainty

(a) Give user query, will results berelevant or not?

(b) System may make estimate, butcertainty can not be obtained

(c) Probability of relevancy is issue

2. Baysesian probability theory:

(a) relates to the collection ofevidence and making informed

guesses

(b) evidence: was document

containing certain keyterms

relevant?

(c) evidence: was document

containing certain keyterms found

not relevant?

(d) Adjust estimate of probability not

on the basis of frequency, but on

the basis of evidence

jbollen@cs.odu.edu

Probabilistic Models

1. Attempts to predict page relevance onprobabilistic grounds

2. Rather than compare specific pagefeatures (index terms)

3. Provide a clear and well-groundedframework for IR

(a) Based on statistical principles:very appropriate to IR!

(b) Relevancy odds can be updated totake into account other factorsthan text content

(c) Possibility of user feedback to be

fed into system

4. Basic idea:

(a) Query is associated with ideal

answer set

(b) Properties of answer→ index

(c) Initial guess at answer set

propertiesrightarrow

probabilistic description

(d) Initial retrieval results

(e) User feedback can finetune

relevancy probabilities

jbollen@cs.odu.edu

Bayes’ rulehttp://www.cs.berkeley.edu/~murphyk/Bayes/economist.html

Bayesian probability:

Change your beliefs on the basis of evidence.

Example 1:Assume two eventsa (rain) andb (clouds).P(a|b) = P(b|a)×P(a)

P(b)P(a|b) probability it will rain we see clouds

P(b|a) how often have we seen clouds when it rains?

P(a) how often does it rain?

P(b) how often have we seen clouds?

Example 2:a→ ”Person has cancer”

b→ ”Person is a smoker”

P(a|b) = ?

In Belgium: P(b) = 0.3.

Among those diagnosed with cancer, about 80% smokes, soP(b|a) = 0.8.

Overall probability of cancerP(a) = 0.05P(a|b) = P(b|a)×P(a)P(b) = 0.8×0.05

0.3 = 0.13

Quit now!

jbollen@cs.odu.edu

Some basic notions of applying probability theoryto IR

Can we do the same for IR?d j is document in collection.

R represents set of relevant documents.

R represents set of non-relevant documents.

Then we can say things like:

P(d j |R) = probability that if a document is relevant, it isd j

We need to determineP(R|d j ) from:

1. P(R)

2. P(d j |C)

Since queries are specified we want to translate these statements, to index term based

statements, for example:

P(ki |R) = probability that document containingki is relevant.

Great! So how do we determine all these probabilities?

jbollen@cs.odu.edu

Probabilistic Models: FormalizationBasic probabilistic notionswi, j ∈ {0,1}, wi,q ∈ {0,1}

R is the set of all relevant documents.

R is R’s complement, i.e. the set of all non-relevant documents.

P(R|~d j ) is the probability thatd j is relevant to queryq or probability of relevancygivend j .

P(R|~d j ) is the probability thatd j is not relevant toq.

sim(d j ,q) = P(R|~d j )P(R|~d j )

Bayes’ rule:P(r|e) = P(e|r)×P(R)P(e) :

sim(d j,q) = P(~d j |R)×P(R)P(~d j |R)×P(R)

P(~d j |R) : probability of selectiond j from set of relevant documents.

P(R): probability that randomly selected document is relevant. As book mentions:P(R) and

P(R) are constant for a given collection, so:

sim(d j ,q)∼ P(~d j |R)P(~d j |R)

jbollen@cs.odu.edu

Probabilistic Models: Formalization (Continued)Moving to keyterm level:

sim(d j ,q)∼(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

whereP(ki |R) stands for the probability thatki ∈ d j selected from R and

P(ki |R) stands for the probability thatki /∈ ind j selected from R.

P(ki |R) stands for our probability to make a mistake, etc.

Finally the canonical probabilistic information retrieval equation:

sim(d j ,q)∼ ∑ti=1 wi,q×wi, j ×

(log P(ki |R)

1−P(ki |R) + log 1−P(ki |R)P(ki |R)

)Does this work? No! We need an initial estimate ofP(ki |R) and P(ki |R)

Various schemes.

P(ki |R) = 0.5

P(ki |R) = niN

Givenwi,q andwi,q these estimates allow an initial set of documents to be retrieved and

ranked.

jbollen@cs.odu.edu

Updating our chancesA bit of recursive reasoning:

Let V be a subset of documents initially retrieved and ranked (e.g. rank above threshold).

Let Vi ⊂V of documents containing documentki .

We can now improve our guesses forP(ki |R) andP(ki |R):

P(k|R) =Vi

P(ki |R) =ni −Vi

So,P(k|R) is updated according to how many of the retrieved set of documents containski .

Similarly, the documents not retrieved indicate our chances of a document containingki not

being relevant.

User assistance can be used to update odds, e.g. user can indicate which documents are

relevant.

jbollen@cs.odu.edu

Updating our chances, contd.Problem when retrieved set is small:

Adjustment factors:

P(k|R) =Vi +0.5V +1

P(ki |R) =ni −Vi +0.5N−V +1

This factor increases the “importance” of small retrieval sets.

Constant 0.5 factor may not always be desireable:

P(k|R) =Vi +

P(ki |R) =ni −Vi +

N−V +1

so that adjustment is is relative to overall number of documents containingki .

jbollen@cs.odu.edu

Conclusion

1. Boolean Model:

(a) Binary index term weights

(b) Disjunctive Normal Factors

(c) No partial matching

(d) No ranking

(e) Simple and very efficient

2. Vector space model

(a) Index term weights:

i. Term Frequencyii. Inverse Document Frequency

(b) Query and document vector

(c) Cosine measure ofquery-document similarity

3. Probabilistic Model:

(a) Based on probabilistic estimate of

relevancy document given certain

keyterms

(b) Allows system to adapt to users

and provides strong formal

framework for retrieval

(c) Rather complicated

(d) Performance:

i. Thought to be less efficient

than vector space model

ii. Problematic initial fine-tuning

iii. Probability: semantics may be

jbollen@cs.odu.edu

How do they stack up?

1. Boolean:

(a) Pro:

i. Efficientii. Easy to implement

iii. Prefered by users in spite ofinability to formulate adequatequeries

(b) Con:

i. No rankingii. No partial matching

iii. Entirely dependent on keytermdefinition and binary weights

2. Vector Space Model:

(a) Pro:

i. Keyterm weighing

ii. Partial Matching

iii. Ranking

iv. Efficient

v. Easy to understand framework

vi. Flexibility

(b) Con:

i. Little formal framework

ii. Assumption of keyterm

independence

3. Probabilistic Model:

(a) Compares poorly to Vector Space

(b) Occam’s razor may apply to IR as

(c) Case for many novel models

jbollen@cs.odu.edu

ReadingsSalton (1988) Term weighting approaches in automatic text retrieval

Wildemuth (1998) Full Text Available Hypertext versus Boolean access to biomedicalinformation: a comparison of effectiveness, efficiency, and user preferences

http://www.cs.odu.edu/~jbollen/spring03_IR/readings/

Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal...

Documents

Transcript of Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal...

Week 3: E-mail Johan Bollenjbollen/CS312/week3.pdf · 3.E-mail addres is thus not uniquely tied to individual(a)Many possible hosts (b)Many possible user names (c)None need to reﬂect

IR DISCREPANCY LR v02c - · PDF filecommon lr ir discrepancy lr - version 02c ir 1 fault + ir att discrepancy ... ir 1 ir 3 ir 2 check att check att?? x check altn law with reduced

Twitter mood predicts the stock market · 2017-11-30 · Twitter mood predicts the stock market Johan Bollen (IU) and Huina Mao (IU) jbollen@indiana.edu, huinmao@indiana.edu School

CS149: Elements of Computer Science Operating Systems ...jbollen/CS149/slides/themes/OS/...CS149: Elements of Computer Science Time-Sharing and Multi-Tasking Remember: A program is

MONITORING WEATHER AND CLIMATE FROM SPACEsatelliteconferences.noaa.gov/2015/doc/presentation... · IR 3.8 IR 8.0 IR 9.7 IR 10.7 IR 11.9 Data were available with timeliness of ~22

NEW RS-23zc IR-VMS Portable Radiation Thermometer IR-AH series (IR-AHS… · 2016. 7. 31. · NEW RS-23zc IR-VMS Portable Radiation Thermometer IR-AH series (IR-AHS,IR-AHUR) EASURE

Institution Information 2018/2019...Ir Dr George SZE Lee Wah Ir Prof Leslie George THAM Ir Ivan TUNG Hon Ming Ir Prof WONG Sze Chun Ir Peter Y WONG Ir Dr Peter YAU Yiu Hung Ir Johnny

Week 11: Image Maps, lists and tables Johan Bollenjbollen/CS312/week11.pdfbold, etc This is bold text

IR Residency IR/DR Certificate

INFRARED WIRELESS TUNER IR-702T - TOA Electronicstoaelectronics.com/media/ir702t_mt1e.pdf · • IR-500R, IR-510R, and IR-520R Infrared Wireless Receivers • IR-200BC Battery Charger

Box IR IR-survey 2012

The Architecture of Complexity Herbert A. Simon Proceedings ...homes.sice.indiana.edu/jbollen/I501F11/readings/week3/...that go beyond the properties of their chemical sub-groups,

2017 16 23 1960 t 2017 IR 11/19 12 IR 11/26 IR 12/3 26 IR 12/10 … · 2020-03-06 · 2017 16 23 1960 t 2017 ir 11/19 12 ir 11/26 ir 12/3 26 ir 12/10 ir 11/21 14 11/28 ir 12/5 28

IR-450 / IR-450-EUR

Bismillah-ir- Rahman-ir-Rahim in the name of Allah,ir ... · Sunday, October 13, 2013 Bismillah-ir-Rahman-ir-Rahim in the name of Allah,ir-Rahman - ir-Rahim sala_mun 'alaikum QUESTION

PZB/IR Empfänger ITC/IR Receiver

ir The verb ir irir ir + a + infinitive & ir + a + location.

I501 – Introduction to Informatics jbollen@indiana.edu Info rm atics and computing Lecture 8 – Fall 2011.

Week 5: Newsgroups: the internet’s bulletin boards Johan ...jbollen/CS312/week5.pdf · 1.Newsgroup etiquette (a)Tell newsgroup client which group (b)to subscribe toRead FAQ and

CS149: Elements of Computer Science Instructor: Johan Bollenjbollen/CS149/slides/themes/arch/arch_algo.pdf · CS149: Elements of Computer Science Instructor: Johan Bollen 1.Background