LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

22
LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee

Transcript of LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

Page 1: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK

2003. 07. 14

Lee Won Hee

Page 2: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

2

Abstract

The language modeling approach to IR Query - random event Documents - ranked according to the likelihood users have a prototypical document in mind and will choose query

terms accordingly inferences about the semantic content of documents do not need to be

made resulting in a conceptually

Page 3: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

3

1. Introduction

the language modeling approach to IR Developed by Ponte and Croft, 1998 Query – random event generated according to a probability distribution Document similarity

- estimating a model of the term generation probabilities for the query terms for each document

- ranking the documents according to the probability of generating the query

The main advantage to the language modeling approach Document boundaries are not predefined

- use the document level statistics of tf and idf Uncertainty are modeled by probabilities

- noisy data such as OCR text and automatically recognized speech transcripts

relevance feedback or document routing

Page 4: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

4

2. The Language Modeling Approach to IR

The query generation probability The probability will be estimated starting with the maximum likelihood

estimate of the probability of term t in document d

- tf(t,d) : the raw term frequency of term t in document d

- dld : the total number of tokens in document d

dd dl

dttfMtmlP

),()|(

^

Page 5: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

5

2.1 Insufficient Data

Two problem with the maximum likelihood estimator We do not wish to assign a probability of zero for a document that is

missing one or more of the query terms

- If a user included several synonyms in the query, a document missing even one of them would not be retrieved

- A more reasonable distribution

We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness

cs

cft * cft : the raw count of term t in the collection

* cs : the raw collection size or the total number of tokens in the collection

Page 6: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

6

2.2 Averaging

The mean probability estimate of t in documents containing it

- to circumvent the problem of insufficient data

- some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies

Combining the two estimates using the geometric distribution

- Ghosh et al., 1983

- robustness of estimation, minimize the risk

dft

MtPml

tavgP dtdd

)(^

))|((

)( • dft : the document frequency of t

dttf

t

t

t

dt

f

f

fR

,

))0.1(

))0.1(

0.1,

^

• : the mean term frequency of term t in documents tf

Page 7: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

7

2.3 Combining the Two Estimates

The estimate of the probability of producing the query for a given document model

- first term : the probability of producing the terms in the query

- second term : the probability of not producing other terms

- better discriminators of the document

Qt

cs

cft

RRd

dtdt tPavgdtPmlMQP ,

^

,

^

)(),({)|( )0.1(^

Qt

cs

cft

RR dtdt tPavgdtPml ,

^

,

^

)(),({0.1 )0.1(

If tf(t,d)>0

otherwise

If tf(t,d)>0

otherwise

Page 8: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

8

3. Related Work

1. The harper and van rijsbergen model

2. The rocchio method

3. The inquery model

4. Exponential models

Page 9: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

9

3.1 The Harper and Van rijsbergen model (1978)

to obtain better estimates for the probability of relevance of a document given the query

An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree each node of the tree : a single query term The edges between nodes : weighted by a measure of term dependency

A tree that spanned all of the nodes and that maximized the expected mutual information

))()(

),(log(),(

,

ji ji

jiji xPxP

xxPxxP

- P(xi,xj) : the probability of term xi and term xj occurring- P(xi) : the probability of term xi occurring in a relevant document- P(xj) : the probability of term xj occurring in a relevant document

Page 10: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

10

3.2 The Rocchio Method (1971)

Rocchio method provides a mechanism for the selection and weighting of expansion terms

can be used to rank the terms in the judged documents

- The top N can then be added to the query and weighted

a reasonable solution to the problem of relevance feedback that works very well in practice

empirically determine the optimal value of , ,

RR

r twR

twR

twtw )(||

0.1)(

||

0.1)()( - : the weight assigned for occurring in relevant

doc

- : the weight assigned for occurring in relevant doc

- : the weight assigned for occurring in non-relevant doc

Page 11: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

11

3.3 The INQUERY Model(1/2)

INQUERY inference network (Turtle, 1991) document portion

- computed in advance query portion

- computed at retrieval time

Document Network document nodes – d1...di

text nodes – t1...tj

concept representation nodes – r1...rk

Query Network query concepts – c1…cm

queries – q1, q2

Information need – I

Uncertainty due to differences in word sense

Figure 3.1 Example inference network

Page 12: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

12

3.3 The INQUERY Model(2/2)

Relevance Feedback Implementation of the theoretical relevance

feedback was done by Hains(1996)

Annotated query network Proposition nodes – k1, k2

Observed relevance judgment nodes – j1, j2

and nodes – require that an annotation to have an effect on the score

The drawback of this technique It requires inferences of considerable

complexity Relevance judgment

- Two additional layers of inference and several new propositions are required

Figure 3.3 Annotated query network

Page 13: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

13

3.4 Exponential Models

An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997) The model utilized ratios of long range language models and short rang

e language models

- predict useful terms

Topic shift

- When a long range language model is not able to predict the next word better than a short range language model

)(

)(log

xPs

xPlL

- Pl(x) : the probability of seeing word x given the context of the last 500 words

- Ps(x) : the probability of seeing word x given the two previous words

Page 14: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

14

4. Query Expansion in the Language Modeling Approach

Assumption of this approach Users can choose query terms that are likely to occur in documents in

which they would interested

This approach has been developed into a ranking formula by means of probabilistic language models

Page 15: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

15

4.1 Interactive Retrieval with Relevance Feedback

Relevance Feedback Small number of documents are judged relevant by user

- The relevance of all the remaining documents is unknown to the system

Page 16: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

16

4.2 Document Routing

Document Routing The task is to choose terms associated with documents of interest and

to avoid those associated with other documents Training collection is available with a large number of relevance

judgments, both positive and negative, for particular query

Ratio Method Can utilize additional information by estimating probabilities for both

sets

Page 17: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

17

4.3 The Ratio Method

Ratio Method predict useful terms Terms can be ranked according to the probability of occurrence

according to the relevant document models Terms are ranked according to this ratio and top N are added to the

initial query

- R : the set of relevant documents

- P(t|Md) : the probability of term t given the document model for d

- cft : the raw count of term t in the collection

- cs : the raw collection size

Rd

dt

cs

cftMtP

L)|(

log

Page 18: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

18

4.4 Evaluation

Result are measured using the recall and precision

= Relevant Set = Non- Relevant Set = Retrieved Set= Non- Retrieved SetR

r

r

R

)|(||

||

)|(||

||Pr

)|(||

||Re

rRpr

RrFallout

RrpR

Rrecision

rRpr

Rrcall

Page 19: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

19

4.5 Experiments(1/2)

Comparison of Rocchio method vs Language Model approach Language Model : log ratio of the probability in the judged relevant set Rocchio : weighting function was tf,idf and no negative feedback( = 0 ) Language Modeling approach works well

Page 20: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

20

4.5 Experiments(2/2)

Page 21: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

21

4.6 Information Routing Ratio Methods With More Data

Ratio 1

Ratio 2

- The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents

Result The language modeling approach is a

good model for retrieval

Rd

dt

cs

cftMtP

L)|(

log

||

)|()|(

||

)|()|(

)|(

)|(log

R

MtPRdtPavg

R

MtPRdtPavg

RdtPavg

RdtPavgL

Rd d

Rd d

t

Page 22: LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

22

5. Query Term Weighting

probability estimation Maximum likelihood probability The average probability (combined a geometric risk function)

risk function current risk function treats all terms equally The change will be to mix the estimation

- useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking )

user specified Language Models Queries

- A specific type of text produced by the user The term weights

- Equivalent to the generation probabilities of the query model