Indexing and Representation: The Vector Space Model Document represented by a vector of terms...
-
date post
21-Dec-2015 -
Category
Documents
-
view
227 -
download
3
Transcript of Indexing and Representation: The Vector Space Model Document represented by a vector of terms...
Indexing and Representation:Indexing and Representation:The Vector Space ModelThe Vector Space Model
Document represented by a vector of termsDocument represented by a vector of terms Words (or word stems)Words (or word stems) Phrases (e.g. computer science)Phrases (e.g. computer science) Removes words on “stop list”Removes words on “stop list”
Documents aren’t about “the”Documents aren’t about “the”
Often assumed that terms are uncorrelated.Often assumed that terms are uncorrelated. Correlations between term vectors implies a similarity Correlations between term vectors implies a similarity
between documents.between documents. For efficiency, an inverted index of terms is often stored.For efficiency, an inverted index of terms is often stored.
Document RepresentationDocument RepresentationWhat values to use for termsWhat values to use for terms Boolean (term present /absent)Boolean (term present /absent) tftf (term frequency) - Count of times term occurs in document. (term frequency) - Count of times term occurs in document.
The more times a term The more times a term tt occurs in document occurs in document dd the more likely the more likely it is that it is that t t is relevant to the document.is relevant to the document.
Used alone, favors common words, long documents.Used alone, favors common words, long documents. dfdf document frequencydocument frequency
The more a term The more a term t t occurs throughout all documents, the more occurs throughout all documents, the more poorly poorly t t discriminates between documentsdiscriminates between documents
tf-idf term frequency * inverse document frequency - tf-idf term frequency * inverse document frequency - High value indicates that the word occurs more often in this High value indicates that the word occurs more often in this
document than average.document than average.
Vector RepresentationVector Representation
Documents and Queries are represented as Documents and Queries are represented as vectors.vectors.
Position 1 corresponds to term 1, position 2 to Position 1 corresponds to term 1, position 2 to term 2, position t to term tterm 2, position t to term t
absent is terma if 0
...,,
,...,,
,21
21
w
wwwQ
wwwD
qtqq
dddi itii
Document Vectors Document Vectors
novanova galaxy galaxy heatheat h’woodh’wood film film rolerole dietdiet furfur
1.01.0 0.5 0.5 0.3 0.3
0.50.5 1.0 1.0
1.01.0 0.8 0.8 0.7 0.7
0.90.9 1.0 1.0 0.5 0.5
1.01.0 1.0 1.0
0.90.9 1.0 1.0
0.50.5 0.7 0.7 0.9 0.9
0.60.6 1.0 1.0 0.3 0.3 0.2 0.2 0.8 0.8
0.70.7 0.5 0.5 0.1 0.1 0.3 0.3
ABCDEFGHI
Document ids
Assigning WeightsAssigning Weights
Want to weight terms highly if they areWant to weight terms highly if they are frequent in relevant documents … BUTfrequent in relevant documents … BUT infrequent in the collection as a wholeinfrequent in the collection as a whole
Assigning WeightsAssigning Weights tf x idf measure:tf x idf measure:
term frequency (tf)term frequency (tf) inverse document frequency (idf)inverse document frequency (idf)
)/log(
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
Nnidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
tf x idftf x idf
Normalize the term weights (so longer documents are not Normalize the term weights (so longer documents are not unfairly given more weight)unfairly given more weight)
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
),(
:Now
1
t
kjkikji wwDDsim
tf x idf normalizationtf x idf normalization Normalize the term weights (so longer documents Normalize the term weights (so longer documents
are not unfairly given more weight)are not unfairly given more weight) normalizenormalize usually means force all values to fall within a usually means force all values to fall within a
certain range, usually between 0 and 1, inclusive.certain range, usually between 0 and 1, inclusive.
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
Vector Space Similarity Vector Space Similarity MeasureMeasurecombine tf x idf into a similarity measurecombine tf x idf into a similarity measure
product)inner normalized is (cosine
)()(
),( :cosine
),( :similarity edunnormaliz
absent is terma if 0 ...,,
,...,,
1
2
1
2
12
1
,21
21
t
jd
t
jqj
t
jdqj
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
Computing Similarity ScoresComputing Similarity Scores
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
Documents in Vector SpaceDocuments in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
Computing a similarity scoreComputing a similarity score
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
DQsim
D
Q
Similarity MeasuresSimilarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Problems with Vector SpaceProblems with Vector Space
There is no real theoretical basis for the There is no real theoretical basis for the assumption of a term spaceassumption of a term space it is more for visualization that having any real it is more for visualization that having any real
basisbasis most similarity measures work about the same most similarity measures work about the same
regardless of modelregardless of model Terms are not really orthogonal dimensionsTerms are not really orthogonal dimensions
Terms are not independent of all other termsTerms are not independent of all other terms
Probabilistic ModelsProbabilistic Models
Rigorous formal model attempts to predict Rigorous formal model attempts to predict the probability that a given document will the probability that a given document will be relevant to a given querybe relevant to a given query
Ranks retrieved documents according to Ranks retrieved documents according to this probability of relevance (Probability this probability of relevance (Probability Ranking Principle)Ranking Principle)
Relies on accurate estimates of Relies on accurate estimates of probabilities for accurate resultsprobabilities for accurate results
Probabilistic RetrievalProbabilistic Retrieval
Goes back to 1960’s (Maron and Kuhns)Goes back to 1960’s (Maron and Kuhns) Robertson’s “Probabilistic Ranking Robertson’s “Probabilistic Ranking
Principle”Principle” Retrieved documents should be ranked in Retrieved documents should be ranked in
decreasing probability that they are relevant decreasing probability that they are relevant to the user’s query.to the user’s query.
How to estimate these probabilities?How to estimate these probabilities? Several methods (Model 1, Model 2, Model 3) with Several methods (Model 1, Model 2, Model 3) with
different emphases on how estimates are done.different emphases on how estimates are done.
Probabilistic Models: Some Probabilistic Models: Some NotationNotation
D = D = All present and future documentsAll present and future documents Q = Q = All present and future queriesAll present and future queries (D(Dii,Q,Qjj) = A document query pair) = A document query pair x = x = class of similar documents,class of similar documents, y = y = class of similar queries, class of similar queries, Relevance is a relation:Relevance is a relation:
}Q submittinguser by therelevant judged
isDdocument ,Q ,D | )Q,{(D R
j
ijiji QD
Dx Qy
Probabilistic Models: Logistic Probabilistic Models: Logistic RegressionRegression
6
10),|(
iiiXccDQRP
Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown next
Probabilistic Models: Logistic Probabilistic Models: Logistic Regression attributesRegression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
Probabilistic ModelsProbabilistic Models
Strong theoretical Strong theoretical basisbasis
In principle should In principle should supply the best supply the best predictions of predictions of relevance given relevance given available informationavailable information
Can be implemented Can be implemented similarly to Vectorsimilarly to Vector
Relevance information Relevance information is required -- or is is required -- or is “guestimated”“guestimated”
Important indicators Important indicators of relevance may not of relevance may not be term -- though be term -- though terms only are usually terms only are usually usedused
Optimally requires on-Optimally requires on-going collection of going collection of relevance informationrelevance information
Advantages Disadvantages
Vector and Probabilistic Vector and Probabilistic ModelsModels Support “natural language” queriesSupport “natural language” queries Treat documents and queries the sameTreat documents and queries the same Support relevance feedback searchingSupport relevance feedback searching Support ranked retrievalSupport ranked retrieval Differ primarily in theoretical basis and in Differ primarily in theoretical basis and in
how the ranking is calculatedhow the ranking is calculated Vector assumes relevance Vector assumes relevance Probabilistic relies on relevance judgments Probabilistic relies on relevance judgments
or estimates or estimates
Simple Presentation of Simple Presentation of Results Results
Order by similarityOrder by similarity Decreased order of presumed relevanceDecreased order of presumed relevance Items retrieved early in search may help Items retrieved early in search may help
generate feedback by relevance feedbackgenerate feedback by relevance feedback Select top Select top k k documentsdocuments Select documents within Select documents within of queryof query
Problems with Vector SpaceProblems with Vector Space
There is no real theoretical basis for the There is no real theoretical basis for the assumption of a term spaceassumption of a term space it is more for visualization that having any real it is more for visualization that having any real
basisbasis most similarity measures work about the same most similarity measures work about the same
regardless of modelregardless of model Terms are not really orthogonal dimensionsTerms are not really orthogonal dimensions
Terms are not independent of all other termsTerms are not independent of all other terms
EvaluationEvaluation
RelevanceRelevance Evaluation of IR Systems Evaluation of IR Systems
Precision vs. RecallPrecision vs. Recall Cutoff PointsCutoff Points Test Collections/TRECTest Collections/TREC Blair & Maron StudyBlair & Maron Study
What to Evaluate?What to Evaluate?
How much learned about the collection?How much learned about the collection? How much learned about a topic?How much learned about a topic? How much of the information need is How much of the information need is
satisfied?satisfied? How inviting the system is?How inviting the system is?
What to Evaluate?What to Evaluate?
What can be measured that reflects users’ ability What can be measured that reflects users’ ability to use system? to use system? (Cleverdon 66)(Cleverdon 66)
Coverage of InformationCoverage of Information Form of PresentationForm of Presentation Effort required/Ease of UseEffort required/Ease of Use Time and Space EfficiencyTime and Space Efficiency RecallRecall
proportion of relevant material actually retrievedproportion of relevant material actually retrieved
PrecisionPrecision proportion of retrieved material actually relevantproportion of retrieved material actually relevant
effectiveness
RelevanceRelevance In what ways can a document be relevant In what ways can a document be relevant
to a query?to a query? Answer precise question precisely.Answer precise question precisely. Partially answer question.Partially answer question. Suggest a source for more information.Suggest a source for more information. Give background information.Give background information. Remind the user of other knowledge.Remind the user of other knowledge. Others ...Others ...
Standard IR EvaluationStandard IR Evaluation
PrecisionPrecision
RecallRecall
Collection
# relevant in collection
# retrieved
# relevant retrieved
# relevant retrieved
RetrievedDocumen
ts
Precision/Recall CurvesPrecision/Recall Curves
There is a tradeoff between Precision and RecallThere is a tradeoff between Precision and Recall So measure Precision at different levels of RecallSo measure Precision at different levels of Recall
precision
recall
x
x
x
x
Precision/Recall CurvesPrecision/Recall Curves
Difficult to determine which of these two hypothetical Difficult to determine which of these two hypothetical results is better:results is better:
precision
recall
x
x
x
x
Precision/Recall CurvesPrecision/Recall Curves
Document Cutoff LevelsDocument Cutoff Levels
Another way to evaluate:Another way to evaluate: Fix the number of documents retrieved at several Fix the number of documents retrieved at several
levels:levels: top 5, top 10, top 20, top 50, top 100, top 500top 5, top 10, top 20, top 50, top 100, top 500
Measure precision at each of these levelsMeasure precision at each of these levels Take (weighted) average over resultsTake (weighted) average over results
This is a way to focus on high precisionThis is a way to focus on high precision
The E-MeasureThe E-Measure
Combine Precision and Recall into one number Combine Precision and Recall into one number (van (van Rijsbergen 79)Rijsbergen 79)
RPb
PRPRbE
2
2
1
P = precisionR = recallb = measure of relative importance of P or R
For example,b = 0.5 means user is twice as interested in
precision as recall
TRECTREC Text REtrieval Conference/CompetitionText REtrieval Conference/Competition
Run by NIST Run by NIST (National Institute of Standards & Technology)(National Institute of Standards & Technology) 1997 was the 6th year1997 was the 6th year
Collection: 3 Gigabytes, >1 Million DocsCollection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff)Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register)Government documents (federal register)
Queries + Relevance JudgmentsQueries + Relevance Judgments Queries devised and judged by “Information Specialists”Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved -- not Relevance judgments done only for those documents retrieved -- not
entire collection!entire collection! CompetitionCompetition
Various research and commercial groups competeVarious research and commercial groups compete Results judged on precision and recall, going up to a recall level of 1000 Results judged on precision and recall, going up to a recall level of 1000
documentsdocuments
Sample TREC queries Sample TREC queries (topics)(topics)<num> Number: 168<title> Topic: Financing AMTRAK
<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)
<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.
TRECTREC Benefits:Benefits:
made research systems scale to large collections (pre-WWW)made research systems scale to large collections (pre-WWW) allows for somewhat controlled comparisonsallows for somewhat controlled comparisons
Drawbacks:Drawbacks: emphasis on high recall, which may be unrealistic for what emphasis on high recall, which may be unrealistic for what
most users wantmost users want very long queries, also unrealisticvery long queries, also unrealistic comparisons still difficult to make, because systems are quite comparisons still difficult to make, because systems are quite
different on many dimensionsdifferent on many dimensions focus on batch ranking rather than interactionfocus on batch ranking rather than interaction no focus on the WWWno focus on the WWW
TREC ResultsTREC Results Differ each yearDiffer each year For the main track:For the main track:
Best systems not statistically significantly differentBest systems not statistically significantly different Small differences sometimes have big effectsSmall differences sometimes have big effects
how good was the hyphenation modelhow good was the hyphenation model how was document length taken into accounthow was document length taken into account
Systems were optimized for longer queries and all performed Systems were optimized for longer queries and all performed worse for shorter, more realistic queriesworse for shorter, more realistic queries
Excitement is in the new tracksExcitement is in the new tracks InteractiveInteractive MultilingualMultilingual NLPNLP
Blair and Maron 1985Blair and Maron 1985
Highly influential paperHighly influential paper A classic study of retrieval effectivenessA classic study of retrieval effectiveness
earlier studies were on unrealistically small collectionsearlier studies were on unrealistically small collections
Studied an archive of documents for a legal suitStudied an archive of documents for a legal suit ~350,000 pages of text~350,000 pages of text 40 queries40 queries focus on focus on high recallhigh recall
Used IBM’s STAIRS full-text systemUsed IBM’s STAIRS full-text system Main Result: Main Result: System retrieved less than 20% of the relevant System retrieved less than 20% of the relevant
documents for a particular information needs when lawyers documents for a particular information needs when lawyers thought they had 75%thought they had 75%
But many queries had very high precisionBut many queries had very high precision
Blair and Maron, cont.Blair and Maron, cont.
Why recall was lowWhy recall was low users can’t foresee exact words and phrases that users can’t foresee exact words and phrases that
will indicate relevant documentswill indicate relevant documents ““accident” referred to by those responsible as:accident” referred to by those responsible as:
““event,” “incident,” “situation,” “problem,” …event,” “incident,” “situation,” “problem,” … differing technical terminologydiffering technical terminology slang, misspellingsslang, misspellings
Perhaps the value of higher recall decreases as the Perhaps the value of higher recall decreases as the number of relevant documents grows, so more number of relevant documents grows, so more detailed queries were not attempted once the users detailed queries were not attempted once the users were satisfiedwere satisfied
Blair and Maron, cont.Blair and Maron, cont.
Why recall was lowWhy recall was low users can’t foresee exact words and phrases that users can’t foresee exact words and phrases that
will indicate relevant documentswill indicate relevant documents ““accident” referred to by those responsible as:accident” referred to by those responsible as:
““event,” “incident,” “situation,” “problem,” …event,” “incident,” “situation,” “problem,” … differing technical terminologydiffering technical terminology slang, misspellingsslang, misspellings
Perhaps the value of higher recall decreases as Perhaps the value of higher recall decreases as the number of relevant documents grows, so more the number of relevant documents grows, so more detailed queries were not attempted once the detailed queries were not attempted once the users were satisfiedusers were satisfied