Indexing and Representation: The Vector Space Model Document represented by a vector of terms...

Indexing and Representation:Indexing and Representation:The Vector Space ModelThe Vector Space Model

Document represented by a vector of termsDocument represented by a vector of terms Words (or word stems)Words (or word stems) Phrases (e.g. computer science)Phrases (e.g. computer science) Removes words on “stop list”Removes words on “stop list”

Documents aren’t about “the”Documents aren’t about “the”

Often assumed that terms are uncorrelated.Often assumed that terms are uncorrelated. Correlations between term vectors implies a similarity Correlations between term vectors implies a similarity

between documents.between documents. For efficiency, an inverted index of terms is often stored.For efficiency, an inverted index of terms is often stored.

Document RepresentationDocument RepresentationWhat values to use for termsWhat values to use for terms Boolean (term present /absent)Boolean (term present /absent) tftf (term frequency) - Count of times term occurs in document. (term frequency) - Count of times term occurs in document.

The more times a term The more times a term tt occurs in document occurs in document dd the more likely the more likely it is that it is that t t is relevant to the document.is relevant to the document.

Used alone, favors common words, long documents.Used alone, favors common words, long documents. dfdf document frequencydocument frequency

The more a term The more a term t t occurs throughout all documents, the more occurs throughout all documents, the more poorly poorly t t discriminates between documentsdiscriminates between documents

tf-idf term frequency * inverse document frequency - tf-idf term frequency * inverse document frequency - High value indicates that the word occurs more often in this High value indicates that the word occurs more often in this

document than average.document than average.

Vector RepresentationVector Representation

Documents and Queries are represented as Documents and Queries are represented as vectors.vectors.

Position 1 corresponds to term 1, position 2 to Position 1 corresponds to term 1, position 2 to term 2, position t to term tterm 2, position t to term t

absent is terma if 0

...,,

,...,,

,21

21

w

wwwQ

wwwD

qtqq

dddi itii

Document Vectors Document Vectors

novanova galaxy galaxy heatheat h’woodh’wood film film rolerole dietdiet furfur

1.01.0 0.5 0.5 0.3 0.3

0.50.5 1.0 1.0

1.01.0 0.8 0.8 0.7 0.7

0.90.9 1.0 1.0 0.5 0.5

1.01.0 1.0 1.0

0.90.9 1.0 1.0

0.50.5 0.7 0.7 0.9 0.9

0.60.6 1.0 1.0 0.3 0.3 0.2 0.2 0.8 0.8

0.70.7 0.5 0.5 0.1 0.1 0.3 0.3

ABCDEFGHI

Document ids

Assigning WeightsAssigning Weights

Want to weight terms highly if they areWant to weight terms highly if they are frequent in relevant documents … BUTfrequent in relevant documents … BUT infrequent in the collection as a wholeinfrequent in the collection as a whole

Assigning WeightsAssigning Weights tf x idf measure:tf x idf measure:

term frequency (tf)term frequency (tf) inverse document frequency (idf)inverse document frequency (idf)

)/log(

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

Nnidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

tf x idftf x idf

Normalize the term weights (so longer documents are not Normalize the term weights (so longer documents are not unfairly given more weight)unfairly given more weight)

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

),(

:Now

1

t

kjkikji wwDDsim

tf x idf normalizationtf x idf normalization Normalize the term weights (so longer documents Normalize the term weights (so longer documents

are not unfairly given more weight)are not unfairly given more weight) normalizenormalize usually means force all values to fall within a usually means force all values to fall within a

certain range, usually between 0 and 1, inclusive.certain range, usually between 0 and 1, inclusive.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

Vector Space Similarity Vector Space Similarity MeasureMeasurecombine tf x idf into a similarity measurecombine tf x idf into a similarity measure

product)inner normalized is (cosine

)()(

),( :cosine

),( :similarity edunnormaliz

absent is terma if 0 ...,,

,...,,

1

2

1

2

12

1

,21

21

t

jd

t

jqj

t

jdqj

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Computing Similarity ScoresComputing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Documents in Vector SpaceDocuments in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Computing a similarity scoreComputing a similarity score

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q

Similarity MeasuresSimilarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Problems with Vector SpaceProblems with Vector Space

There is no real theoretical basis for the There is no real theoretical basis for the assumption of a term spaceassumption of a term space it is more for visualization that having any real it is more for visualization that having any real

basisbasis most similarity measures work about the same most similarity measures work about the same

regardless of modelregardless of model Terms are not really orthogonal dimensionsTerms are not really orthogonal dimensions

Terms are not independent of all other termsTerms are not independent of all other terms

Probabilistic ModelsProbabilistic Models

Rigorous formal model attempts to predict Rigorous formal model attempts to predict the probability that a given document will the probability that a given document will be relevant to a given querybe relevant to a given query

Ranks retrieved documents according to Ranks retrieved documents according to this probability of relevance (Probability this probability of relevance (Probability Ranking Principle)Ranking Principle)

Relies on accurate estimates of Relies on accurate estimates of probabilities for accurate resultsprobabilities for accurate results

Probabilistic RetrievalProbabilistic Retrieval

Goes back to 1960’s (Maron and Kuhns)Goes back to 1960’s (Maron and Kuhns) Robertson’s “Probabilistic Ranking Robertson’s “Probabilistic Ranking

Principle”Principle” Retrieved documents should be ranked in Retrieved documents should be ranked in

decreasing probability that they are relevant decreasing probability that they are relevant to the user’s query.to the user’s query.

How to estimate these probabilities?How to estimate these probabilities? Several methods (Model 1, Model 2, Model 3) with Several methods (Model 1, Model 2, Model 3) with

different emphases on how estimates are done.different emphases on how estimates are done.

Probabilistic Models: Some Probabilistic Models: Some NotationNotation

D = D = All present and future documentsAll present and future documents Q = Q = All present and future queriesAll present and future queries (D(Dii,Q,Qjj) = A document query pair) = A document query pair x = x = class of similar documents,class of similar documents, y = y = class of similar queries, class of similar queries, Relevance is a relation:Relevance is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

Dx Qy

Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression

6

10),|(

iiiXccDQRP

Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown next

Probabilistic Models: Logistic Probabilistic Models: Logistic Regression attributesRegression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

Probabilistic ModelsProbabilistic Models

Strong theoretical Strong theoretical basisbasis

In principle should In principle should supply the best supply the best predictions of predictions of relevance given relevance given available informationavailable information

Can be implemented Can be implemented similarly to Vectorsimilarly to Vector

Relevance information Relevance information is required -- or is is required -- or is “guestimated”“guestimated”

Important indicators Important indicators of relevance may not of relevance may not be term -- though be term -- though terms only are usually terms only are usually usedused

Optimally requires on-Optimally requires on-going collection of going collection of relevance informationrelevance information

Advantages Disadvantages

Vector and Probabilistic Vector and Probabilistic ModelsModels Support “natural language” queriesSupport “natural language” queries Treat documents and queries the sameTreat documents and queries the same Support relevance feedback searchingSupport relevance feedback searching Support ranked retrievalSupport ranked retrieval Differ primarily in theoretical basis and in Differ primarily in theoretical basis and in

how the ranking is calculatedhow the ranking is calculated Vector assumes relevance Vector assumes relevance Probabilistic relies on relevance judgments Probabilistic relies on relevance judgments

or estimates or estimates

Simple Presentation of Simple Presentation of Results Results

Order by similarityOrder by similarity Decreased order of presumed relevanceDecreased order of presumed relevance Items retrieved early in search may help Items retrieved early in search may help

generate feedback by relevance feedbackgenerate feedback by relevance feedback Select top Select top k k documentsdocuments Select documents within Select documents within of queryof query

Problems with Vector SpaceProblems with Vector Space

There is no real theoretical basis for the There is no real theoretical basis for the assumption of a term spaceassumption of a term space it is more for visualization that having any real it is more for visualization that having any real

basisbasis most similarity measures work about the same most similarity measures work about the same

regardless of modelregardless of model Terms are not really orthogonal dimensionsTerms are not really orthogonal dimensions

Terms are not independent of all other termsTerms are not independent of all other terms

EvaluationEvaluation

RelevanceRelevance Evaluation of IR Systems Evaluation of IR Systems

Precision vs. RecallPrecision vs. Recall Cutoff PointsCutoff Points Test Collections/TRECTest Collections/TREC Blair & Maron StudyBlair & Maron Study

What to Evaluate?What to Evaluate?

How much learned about the collection?How much learned about the collection? How much learned about a topic?How much learned about a topic? How much of the information need is How much of the information need is

satisfied?satisfied? How inviting the system is?How inviting the system is?

What to Evaluate?What to Evaluate?

What can be measured that reflects users’ ability What can be measured that reflects users’ ability to use system? to use system? (Cleverdon 66)(Cleverdon 66)

Coverage of InformationCoverage of Information Form of PresentationForm of Presentation Effort required/Ease of UseEffort required/Ease of Use Time and Space EfficiencyTime and Space Efficiency RecallRecall

proportion of relevant material actually retrievedproportion of relevant material actually retrieved

PrecisionPrecision proportion of retrieved material actually relevantproportion of retrieved material actually relevant

effectiveness

RelevanceRelevance In what ways can a document be relevant In what ways can a document be relevant

to a query?to a query? Answer precise question precisely.Answer precise question precisely. Partially answer question.Partially answer question. Suggest a source for more information.Suggest a source for more information. Give background information.Give background information. Remind the user of other knowledge.Remind the user of other knowledge. Others ...Others ...

Standard IR EvaluationStandard IR Evaluation

PrecisionPrecision

RecallRecall

Collection

# relevant in collection

# retrieved

# relevant retrieved

# relevant retrieved

RetrievedDocumen

ts

Precision/Recall CurvesPrecision/Recall Curves

There is a tradeoff between Precision and RecallThere is a tradeoff between Precision and Recall So measure Precision at different levels of RecallSo measure Precision at different levels of Recall

precision

recall

x

x

x

x


Difficult to determine which of these two hypothetical Difficult to determine which of these two hypothetical results is better:results is better:

precision

recall

x

x

x

x

Document Cutoff LevelsDocument Cutoff Levels

Another way to evaluate:Another way to evaluate: Fix the number of documents retrieved at several Fix the number of documents retrieved at several

levels:levels: top 5, top 10, top 20, top 50, top 100, top 500top 5, top 10, top 20, top 50, top 100, top 500

Measure precision at each of these levelsMeasure precision at each of these levels Take (weighted) average over resultsTake (weighted) average over results

This is a way to focus on high precisionThis is a way to focus on high precision

The E-MeasureThe E-Measure

Combine Precision and Recall into one number Combine Precision and Recall into one number (van (van Rijsbergen 79)Rijsbergen 79)

RPb

PRPRbE

2

2

1

P = precisionR = recallb = measure of relative importance of P or R

For example,b = 0.5 means user is twice as interested in

precision as recall

TRECTREC Text REtrieval Conference/CompetitionText REtrieval Conference/Competition

Run by NIST Run by NIST (National Institute of Standards & Technology)(National Institute of Standards & Technology) 1997 was the 6th year1997 was the 6th year

Collection: 3 Gigabytes, >1 Million DocsCollection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff)Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register)Government documents (federal register)

Queries + Relevance JudgmentsQueries + Relevance Judgments Queries devised and judged by “Information Specialists”Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved -- not Relevance judgments done only for those documents retrieved -- not

entire collection!entire collection! CompetitionCompetition

Various research and commercial groups competeVarious research and commercial groups compete Results judged on precision and recall, going up to a recall level of 1000 Results judged on precision and recall, going up to a recall level of 1000

documentsdocuments

Sample TREC queries Sample TREC queries (topics)(topics)<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

TRECTREC Benefits:Benefits:

made research systems scale to large collections (pre-WWW)made research systems scale to large collections (pre-WWW) allows for somewhat controlled comparisonsallows for somewhat controlled comparisons

Drawbacks:Drawbacks: emphasis on high recall, which may be unrealistic for what emphasis on high recall, which may be unrealistic for what

most users wantmost users want very long queries, also unrealisticvery long queries, also unrealistic comparisons still difficult to make, because systems are quite comparisons still difficult to make, because systems are quite

different on many dimensionsdifferent on many dimensions focus on batch ranking rather than interactionfocus on batch ranking rather than interaction no focus on the WWWno focus on the WWW

TREC ResultsTREC Results Differ each yearDiffer each year For the main track:For the main track:

Best systems not statistically significantly differentBest systems not statistically significantly different Small differences sometimes have big effectsSmall differences sometimes have big effects

how good was the hyphenation modelhow good was the hyphenation model how was document length taken into accounthow was document length taken into account

Systems were optimized for longer queries and all performed Systems were optimized for longer queries and all performed worse for shorter, more realistic queriesworse for shorter, more realistic queries

Excitement is in the new tracksExcitement is in the new tracks InteractiveInteractive MultilingualMultilingual NLPNLP

Blair and Maron 1985Blair and Maron 1985

Highly influential paperHighly influential paper A classic study of retrieval effectivenessA classic study of retrieval effectiveness

earlier studies were on unrealistically small collectionsearlier studies were on unrealistically small collections

Studied an archive of documents for a legal suitStudied an archive of documents for a legal suit ~350,000 pages of text~350,000 pages of text 40 queries40 queries focus on focus on high recallhigh recall

Used IBM’s STAIRS full-text systemUsed IBM’s STAIRS full-text system Main Result: Main Result: System retrieved less than 20% of the relevant System retrieved less than 20% of the relevant

documents for a particular information needs when lawyers documents for a particular information needs when lawyers thought they had 75%thought they had 75%

But many queries had very high precisionBut many queries had very high precision

Blair and Maron, cont.Blair and Maron, cont.

Why recall was lowWhy recall was low users can’t foresee exact words and phrases that users can’t foresee exact words and phrases that

will indicate relevant documentswill indicate relevant documents ““accident” referred to by those responsible as:accident” referred to by those responsible as:

““event,” “incident,” “situation,” “problem,” …event,” “incident,” “situation,” “problem,” … differing technical terminologydiffering technical terminology slang, misspellingsslang, misspellings

Perhaps the value of higher recall decreases as the Perhaps the value of higher recall decreases as the number of relevant documents grows, so more number of relevant documents grows, so more detailed queries were not attempted once the users detailed queries were not attempted once the users were satisfiedwere satisfied

Blair and Maron, cont.Blair and Maron, cont.

Why recall was lowWhy recall was low users can’t foresee exact words and phrases that users can’t foresee exact words and phrases that

will indicate relevant documentswill indicate relevant documents ““accident” referred to by those responsible as:accident” referred to by those responsible as:

““event,” “incident,” “situation,” “problem,” …event,” “incident,” “situation,” “problem,” … differing technical terminologydiffering technical terminology slang, misspellingsslang, misspellings

Perhaps the value of higher recall decreases as Perhaps the value of higher recall decreases as the number of relevant documents grows, so more the number of relevant documents grows, so more detailed queries were not attempted once the detailed queries were not attempted once the users were satisfiedusers were satisfied

Indexing and Representation: The Vector Space Model Document represented by a vector of terms...

Documents

Transcript of Indexing and Representation: The Vector Space Model Document represented by a vector of terms...