Query Operations J. H. Wang Mar. 26, 2008. The Retrieval Process User Interface Text Operations...

30
Query Operations J. H. Wang Mar. 26, 2008

Transcript of Query Operations J. H. Wang Mar. 26, 2008. The Retrieval Process User Interface Text Operations...

Query Operations

J. H. WangMar. 26, 2008

The Retrieval ProcessUserInterface

Text Operations

Query Operations

Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

Query Modification

• Improving initial query formulation– Relevance feedback

• approaches based on feedback information from users

– Local analysis • approaches based on information derived

from the set of documents initially retrieved (called the local set of documents)

– Global analysis• approaches based on global information

derived from the document collection

Relevance Feedback

• Relevance feedback process– shields the user from the details of the query

reformulation process– breaks down the whole searching task into a

sequence of small steps which are easier to grasp– provides a controlled process designed to emphasize

some terms and de-emphasize others

• Two basic techniques– Query expansion

• addition of new terms from relevant documents– Term reweighting

• modification of term weights based on the user relevance judgement

Vector Space Model

• Definitionwi,j: the ith term in the vector for document dj

wi,k: the ith term in the vector for query qk

t: the number of unique terms in the data set

t

i

kijikj wwqdsimilarity1

,,),(),,,( ,,2,1 jtjjj wwwd ),,,( ,,2,1 ktkkk wwwq

t

k ktf

tf

itf

tf

ji

idf

idfw

jkk

jk

jkk

ji

1

22}{max

}{max

,

)5.05.0(

)5.05.0(

,

,

,

,

Query Expansion and Term Reweighting for the Vector

Model• Ideal situation

– CR: set of relevant documents among all documents in the collection

• Rocchio (1965, 1971)– R: set of relevant documents, as identified by

the user among the retrieved documents– S: set of non-relevant documents among the

retrieved documents

RjRj Cdj

RCd

j

Ropt d

CNd

Cq

||

1

||

1

Sdj

Rdjm jj

dS

dR

qq||||

Rocchio’s Algorithm

• Ide_Regular (1971)

• Ide_Dec_Hi

• Parameters = = =1 >

}|{ SddMaxdqq jjRd

jm j

Sdj

Rdjm jj

ddqq

Probabilistic Model

• Definition– pi: the probability of observing term ti in the set of

relevant documents– qi: the probability of observing term ti in the set of

nonrelevant documents

• Initial search assumption– pi is constant for all terms ti (typically 0.5)– qi can be approximated by the distribution of ti in the

whole collection

t

i ii

iiqijij pq

qpwwqdsim

1,, )1(

)1(log),(

iii

i

ii

iii idf

df

N

df

dfN

pq

qpwt

log)(

log)1(

)1(log

Term Reweighting for the Probabilistic Model

• Robertson and Sparck Jones (1976)• With relevance feedback from user

N: the number of documents in the collectionR: the number of relevant documents for query qni: the number of documents having term ti

ri: the number of relevant documents having term ti

Document Relevance

DocumentIndexing

+

-

+

ri

R-ri

R

N-ni-R+ri

-

ni-ri

N-R

ni

N-ni

N

• Initial search assumption•pi is constant for all terms ti (typically 0.5)•qi can be approximated by the distribution of ti in the whole collection

•With relevance feedback from users•pi and qi can be approximated by

•hence the term weight is updated by

)(R

rp i

i )(RN

rnq ii

i

t

i i

iqijij n

nNwwqdsim

1,, log),(

t

i iii

iiiqijij rnrR

rRnNrwwqdsim

1,, ))((

)(log),(

Term Reweighting for the Probabilistic Model (cont.)

• However, the last formula poses problems for certain small values of R and ri (R=1, ri=0)

• Instead of 0.5, alternative adjustments have been proposed

)1

5.0(

R

rp i

i )1

5.0(

RN

rnq ii

i

)1

(

R

rp N

ni

i

i

)1

(

RN

rnq N

nii

i

i

Term Reweighting for the Probabilistic Model (Cont.)

• Characteristics– Advantage

• the term reweighting is optimal under the assumptions of

– term independence

– binary document indexing (wi,q {0,1} and wi,j {0,1})

– Disadvantage• no query expansion is used• weights of terms in the previous query formulations

are also disregarded• document term weights are not taken into account

during the feedback loop

Term Reweighting for the Probabilistic Model (Cont.)

Evaluation of relevance feedback

• Standard evaluation method is not suitable– (i.e., recall-precision) because the relevant

documents used to reweight the query terms are moved to higher ranks

• The residual collection method– the set of all documents minus the set of feedback

documents provided by the user– because highly ranked documents are removed from

the collection, the recall-precision figures for tend to be lower than the figures for the original query

– as a basic rule of thumb, any experimentation involving relevance feedback strategies should always evaluate recall-precision figures relative to the residual collection

mqq

Automatic Strategies

• In relevance feedback, use separates the documents into two classes: relevant vs. non-relevant– An underlying notion of clustering

supporting the feedback strategy– Known relevant documents contain

terms which can be used to describe a larger cluster of relevant documents

– This can be done automatically

Automatic Strategies

• Two types of strategies– Global

• All documents in the collection are used to determine a global thesaurus-like structure which defines term relationships

– Local• The documents retrieved for a given query

are examined at query time to determine terms for query expansion

• Local clustering (Attar and Fraenkel, 1977)• Local context analysis (Xu and Croft, 1996)

Automatic Local Analysis• Definition

– local document set Dl : the set of documents retrieved by a query

– local vocabulary Vl : the set of all distinct words in Dl

– stemmed vocabulary Sl : the set of all distinct stems derived from Vl

• Local feedback strategies are based on expanding the query with terms correlated to the query terms– Such terms are those present in local clusters built from the

local document set

• Building local clusters– association clusters– metric clusters– scalar clusters

Association Clusters

• Idea– co-occurrence of stems (or terms) inside

documents

• fu,j: the frequency of a stem ku in a document dj

– local association cluster for a stem ku

• the set of k largest values in c(ku, kv)

– given a query q, find clusters for the |q| query terms

– normalized form

||

1,,),(

D

jjvjuvu ffkkc

),(),(),(

),(),(

vuvvuu

vuvu kkckkckkc

kkckks

Metric Clusters

• Idea– consider the distance between two terms in the same

cluster

• Definition– V(ku): the set of keywords which have the same stem form as

ku

– distance r(ki, kj)=the number of words between term ku and kv

– normalized form

)( )( ),(

1),(

u vkVi kVj jivu kkr

kkc

|)(||)(|

),(),(

vu

vuvu kVkV

kkckks

Scalar Clusters

• Idea– two stems with similar neighborhoods have

some synonymity relationships• Definition

– cu,v=c(ku, kv)– vectors of correlation values for stem ku and kv

– scalar association matrix

– scalar clusters• the set of k largest values of scalar association

),,,( ,2,1, tuuuu cccs ),,,( ,2,1, tvvvv cccs

||||,

vu

vuvu

ss

ssS

Automatic Global Analysis

• A thesaurus-like structure• Short history

– Until the beginning of the 1990s, global analysis was considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections

– This perception has changed with the appearance of modern procedures for global analysis

Query Expansion based on a Similarity Thesaurus

• Idea by Qiu and Frei [1993]– Similarity thesaurus is based on term to term

relationships rather than on a matrix of co-occurrence– Terms for expansion are selected based on their

similarity to the whole query rather than on their similarities to individual query terms

• Definition– N: total number of documents in the collection– t: total number of terms in the collection– tfi,j: occurrence frequency of term ki in the document dj

– tj: the number of distinct index terms in the document dj

– itfj : the inverse term frequency for document dj

jj t

titf log

Similarity Thesaurus

• Each term is associated with a vector

– where wi,j is a weight associated to the index-document pair

• The relationship between two terms ku and kv is

– Note that this is a variation of the correlation measure used for computing scalar association matrices

),,,( ,2,1, Niii wwwki

N

k ktf

tf

jtf

tf

ji

itf

itfw

kik

ki

kik

ji

1

22}{max

}{max

,

)5.05.0(

)5.05.0(

,

,

,

,

N

jjvjuvuvu wwkkc

1,,,

Term weighting vs. Term concept space

tfij

Term ki

Doc dj tfijTerm ki

Doc dj

t

k ktf

tf

itf

tf

ji

idf

idfw

jkk

jk

jkk

ji

1

22}{max

}{max

,

)5.05.0(

)5.05.0(

,

,

,

,

N

k ktf

tf

jtf

tf

ji

itf

itfw

kik

ki

kik

ji

1

22}{max

}{max

,

)5.05.0(

)5.05.0(

,

,

,

,

Query Expansion Procedure with Similarity Thesaurus

1. Represent the query in the concept space by using the representation of the index terms

2. Compute the similarity sim(q,kv) between each term kv and the whole query

3. Expand the query with the top r ranked terms according to sim(q,kv)

uqk

kwqu

qu

,

vuQk

quvqk

uquvv cwkkwkqkqsimuu

,,,),(

qk qu

vqv

uw

kqsimw

,',

),(

Example of Similarity Thesaurus

The distance of a given term kv to the query centroid QC might be quite distinct from the distances of kv to the individual query terms

ka kb

ki

kj

kv

QC

QC={ka ,kb}

Query Expansion based on a Similarity Thesaurus

– A document dj is represented term-concept space by

– If the original query q is expanded to include all the t index terms, then the similarity sim(q, dj) between the document dj and the query q can be computed as

• which is similar to the generalized vector space model

jv u

jvu

dkvu

qkqujvj

dkvjv

qkuquj

cwwdqsim

kwkwdqsim

,,,

,,

),(

),(

jv dk

vjvj kwd ,

Query Expansion based on a Statistical Thesaurus

• Idea by Crouch and Yang (1992)– Use complete link algorithm to produce

small and tight clusters– Use term discrimination value to select

terms for entry into a particular thesaurus class

• Term discrimination value– A measure of the change in space

separation which occurs when a given term is assigned to the document collection

Term Discrimination Value

• Terms– good discriminators: (terms with positive discrimination

values)• index terms

– indifferent discriminators: (near-zero discrimination values)• thesaurus class

– poor discriminators: (negative discrimination values)• term phrases

• Document frequency dfk

– dfk >n/10: high frequency term (poor discriminators)– dfk <n/100: low frequency term (indifferent

discriminators)– n/100 dfk n/10: good discriminator

Statistical Thesaurus

• Term discrimination value theory– the terms which make up a thesaurus class

must be indifferent discriminators• The proposed approach

– cluster the document collection into small, tight clusters

– A thesaurus class is defined as the intersection of all the low frequency terms in that cluster

– documents are indexed by the thesaurus classes

– the thesaurus classes are weighted by

||

||

1 ,

C

wwt

C

i CiC

Discussion

• Query expansion – useful– little explored technique

• Trends and research issues– The combination of local analysis, global

analysis, visual displays, and interactive interfaces is also a current and important research problem