cs707_013112
-
Upload
cecsdistancelab -
Category
Documents
-
view
218 -
download
0
Transcript of cs707_013112
-
8/3/2019 cs707_013112
1/37
Prasad L08VSM-tfidf 1
Vector Space Model : TF - IDF
Adapted from Lectures by
Prabhakar Raghavan (Yahoo and Stanford) and
Christopher Manning (Stanford)
-
8/3/2019 cs707_013112
2/37
Recap last lecturen Collection and vocabulary statistics
n Heaps and Zipfs laws
n Dictionary compression for Boolean indexesn Dictionary string, blocks, front coding
n Postings compressionn Gap encoding using prefix-unique codes
n Variable-Byte and Gamma codes
Prasad 2L08VSM-tfidf
-
8/3/2019 cs707_013112
3/37
This lecture; Sections 6.2-6.4.3n Scoring documents
nTerm frequency
n Collection statistics
n Weighting schemes
n Vector space scoring
Prasad 3L08VSM-tfidf
-
8/3/2019 cs707_013112
4/37
Ranked retrievaln Thus far, our queries have all been Boolean.
n Documents either match or dont.n Good for expert users with precise understanding
of their needs and the collection (e.g., library search).n Also good for applications: Applications can easily
consume 1000s of results.
n Not good for the majority of users.n
Most users incapable of writing Boolean queries(or they are, but they think its too much work).
n Most users dont want to wade through 1000s ofresults (e.g., web search).
Prasad 4L08VSM-tfidf
-
8/3/2019 cs707_013112
5/37
Problem with Boolean search:
feast or faminen Boolean queries often result in either too few (=0)
or too many (1000s) results.
n Query 1: standard user dlink 650 200,000hits
n Query 2: standard user dlink 650 no card found:0 hits
n It takes skill to come up with a query thatproduces a manageable number of hits.
n With a ranked list of documents, it does notmatter how large the retrieved set is.
Prasad 5L08VSM-tfidf
-
8/3/2019 cs707_013112
6/37
Scoring as the basis of ranked
retrievaln We wish to return in orderthe documents most
likely to be useful to the searcher
n How can we rank-order the documents in thecollection with respect to a query?
nAssign a score say in [0, 1] to each document
n This score measures how well document andquery match.
Prasad 6L08VSM-tfidf
-
8/3/2019 cs707_013112
7/37
Query-document matching scoresn We need a way of assigning a score to a query/
document pair
n Lets start with a one-term queryn If the query term does not occur in the document:
score should be 0
n The more frequent the query term in thedocument, the higher the score (should be)
n We will look at a number of alternatives for this.Prasad 7L08VSM-tfidf
-
8/3/2019 cs707_013112
8/37
Take 1: Jaccard coefficientn Recall: Jaccard coefficient is a commonly used
measure of overlap of two setsA and B
jaccard(A,B) = |A B| / |A B|
jaccard(A,A) = 1
jaccard(A,B) = 0ifA B = 0
n A and B dont have to be the same size.n JC always assigns a number between 0and 1.Prasad 8L08VSM-tfidf
-
8/3/2019 cs707_013112
9/37
Jaccard coefficient: Scoring
examplen What is the query-document match score that the
Jaccard coefficient computes for each of the two
documents below?
n Query: ides of marchn Document 1: caesar died in marchn Document 2: the long march
Prasad 9L08VSM-tfidf
-
8/3/2019 cs707_013112
10/37
Issues with Jaccard for scoringn It doesnt consider term frequency (how many
times a term occurs in a document)
n It doesnt consider document/collectionfrequency (rare terms in a collection are more
informative than frequent terms)
n We need a more sophisticated way ofnormalizing for length
n Later in this lecture, well usen . . . instead of |A B|/|A B| (Jaccard) for length
normalization.
|BA|/|BA|
Prasad 10L08VSM-tfidf
-
8/3/2019 cs707_013112
11/37
Recall (Lecture 1): Binary term-
document incidence matrixAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Each document is represented by a binary vector {0,1}|V|
Prasad 11L08VSM-tfidf
-
8/3/2019 cs707_013112
12/37
Term-document count matricesn Consider the number of occurrences of a term in
a document:
n Each document is a count vector in v: a columnbelow
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Prasad 12L08VSM-tfidf
-
8/3/2019 cs707_013112
13/37
Bag of words model
n Vector representation doesnt consider theordering of words in a document
n John is quicker than Maryand Mary is quickerthan John have the same vectors
n This is called the bag of words model.n In a sense, this is a step back: The positional
index was able to distinguish these two
documents.n We will look at recovering positional
information later in this course.
Prasad 13L08VSM-tfidf
-
8/3/2019 cs707_013112
14/37
Term frequency tf
n The term frequency tft,dof term tin document disdefined as the number of times that toccurs in d.
n We want to use tfwhen computing query-document match scores. But how?
n Rawterm frequency is notwhat we want:n A document with 10 occurrences of the term may be
more relevant than a document with one occurrence of
the term.n But not 10 times more relevant.
n Relevance does not increase proportionally withterm frequency.
Prasad 14L08VSM-tfidf
-
8/3/2019 cs707_013112
15/37
Log-frequency weighting
n The log frequency weight of term t in d is
n 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.n Score for a document-query pair: sum over terms
tin both q and d:
n scoren The score is 0 if none of the query terms is present in the
document.
>+
=
otherwise0,
0tfif,tflog1
10 t,dt,d
t,dw
+= dqt dt )tflog(1 ,
Prasad 15L08VSM-tfidf
-
8/3/2019 cs707_013112
16/37
Document frequency
n Rare terms are more informative than frequent terms
n Recall stop words
n Consider a term in the query that is rare in the collection(e.g., arachnocentric)
n A document containing this term is very likely to berelevant to the query arachnocentric
n We want a higher weight for rare terms likearachnocentric.
Prasad 16L08VSM-tfidf
-
8/3/2019 cs707_013112
17/37
Document frequency, continued
n Consider a query term that is frequent in the collection (e.g.,high, increase, line)
n A document containing such a term is more likely to berelevant than a document that doesnt, but its not a sure
indicator of relevance.n For frequent terms, we want positive weights for words
like high, increase, and line, but lower weights than for rare
terms.
n We will use document frequency (df) to capturethis in the score.
n df (N) is the number of documents that containthe term
Prasad 17L08VSM-tfidf
-
8/3/2019 cs707_013112
18/37
idf weight
n dft is the document frequency oft: the number ofdocuments that contain t
n df is a measure of the informativeness oftn We define the idf (inverse document frequency)
oftby
n We use log N/dft instead ofN/dft to dampen theeffect of idf.
ttN/dflogidf 10=
Will turn out that the base of the log is immaterial.
Prasad 18L08VSM-tfidf
-
8/3/2019 cs707_013112
19/37
idf example, suppose N= 1 million
term dft
idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
There is one idf value for each term tin a collection.
Prasad 19L08VSM-tfidf
-
8/3/2019 cs707_013112
20/37
Collection vs. Document frequency
n The collection frequency oftis the number ofoccurrences oftin the collection, counting
multiple occurrences.
n Which word is a better search term (and shouldget a higher weight)?
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
Prasad 20L08VSM-tfidf
-
8/3/2019 cs707_013112
21/37
tf-idf weighting
n The tf-idf weight of a term is the product of its tf weightand its idf weight.
n Best known weighting scheme in information retrievaln Note: the - in tf-idf is a hyphen, not a minus sign!n Alternative names: tf.idf, tf x idf
n Increases with the number of occurrences within adocument
n Increases with the rarity of the term in the collection
tdt Ndt df/log)tflog1(w ,, +=
Prasad 21L08VSM-tfidf
-
8/3/2019 cs707_013112
22/37
Binary count weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Each document is now represented by a real-valuedvector of tf-idf weights R|V|
Prasad 22L08VSM-tfidf
-
8/3/2019 cs707_013112
23/37
Documents as vectors
n So we have a |V|-dimensional vector spacen Terms are axes of the spacen
Documents are points or vectors in this space
n Very high-dimensional: hundreds of millions ofdimensions when you apply this to a web search
enginen This is a very sparse vector - most entries are
zero.
Prasad 23L08VSM-tfidf
-
8/3/2019 cs707_013112
24/37
Queries as vectors
n Key idea 1: Do the same for queries: representthem as vectors in the space
n Key idea 2: Rank documents according to theirproximity to the query in this space
n proximity = similarity of vectorsn proximity inverse of distancen Recall: We do this because we want to get away
from the youre-either-in-or-out Boolean model.
n Instead: rank more relevant documents higherthan less relevant documents
Prasad 24L08VSM-tfidf
-
8/3/2019 cs707_013112
25/37
Formalizing vector space proximity
n First cut: distance between two pointsn ( = distance between the end points of the two
vectors)
n Euclidean distance?
n Euclidean distance is a bad idea . . .n
. . . because Euclidean distance is large forvectors of different lengths.
Prasad 25L08VSM-tfidf
-
8/3/2019 cs707_013112
26/37
Why distance is a bad idea
The Euclideandistance between q
and d2 is large even
though thedistribution of terms
in the query qand
the distribution of
terms in thedocument d2are
very similar.
Prasad 26L08VSM-tfidf
-
8/3/2019 cs707_013112
27/37
Use angle instead of distance
n Thought experiment: take a document d andappend it to itself. Call this document d.
n Semantically d and dhave the same contentn The Euclidean distance between the two
documents can be quite large
n The angle between the two documents is 0,corresponding to maximal similarity.
n Key idea: Rank documents according to anglewith query.
Prasad 27L08VSM-tfidf
-
8/3/2019 cs707_013112
28/37
From angles to cosines
n The following two notions are equivalent.n Rank documents in decreasing order of the angle
between query and document
n Rank documents in increasing order ofcosine(query,document)
n Cosine is a monotonically decreasing function forthe interval [0o, 180o]
Prasad 28L08VSM-tfidf
-
8/3/2019 cs707_013112
29/37
Length normalization
n A vector can be (length-) normalized by dividingeach of its components by its length for this we
use the L2 norm:
n Dividing a vector by its L2 norm makes it a unit(length) vector
n Effect on the two documents d and d (d appendedto itself) from earlier slide: they have identical
vectors after length-normalization.
=
i i
xx2
2
Prasad 29L08VSM-tfidf
-
8/3/2019 cs707_013112
30/37
cosine(query,document)
==
===
=
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dq
dq1
2
1
2
1
),cos(
Dot product Unit vectors
qi is the tf-idf weight of term iin the query
di is the tf-idf weight of term iin the documentcos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.
Prasad 30L08VSM-tfidf
-
8/3/2019 cs707_013112
31/37
Cosine similarity amongst 3 documents
term SaS PaP WH
affection 115 58 20
jealous 10 7
gossip 2 0 6
wuthering 0 0 38
How similar are
the novels
SaS: Sense and
Sensibility
PaP: Pride and
Prejudice, and
WH: Wuthering
Heights?Term frequencies (counts)
Prasad 31L08VSM-tfidf
-
8/3/2019 cs707_013112
32/37
3 documents example contd.Log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
After normalization
term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588
cos(SaS,PaP)
0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.00.94cos(SaS,WH)0.79cos(PaP,WH) 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
-
8/3/2019 cs707_013112
33/37
Computing cosine scores
-
8/3/2019 cs707_013112
34/37
tf-idf weighting has many variants
Columns headed n are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
-
8/3/2019 cs707_013112
35/37
Weighting may differ in queries vs
documents
n Many search engines allow for differentweightings for queries vs documents
n To denote the combination in use in an engine,we use the notation qqq.ddd with the acronyms
from the previous table
n Example: ltn.lnc means:n Query: logarithmic tf (l in leftmost column), idf (t in
second column), no normalization
n Document logarithmic tf, no idf and cosinenormalization
Is this a bad idea?Prasad 35
-
8/3/2019 cs707_013112
36/37
tf-idf example: ltn.lnc
Term Query Document Prod
tf-raw tf-wt df idf wt tf-raw tf-wt wt nlizedauto 0 0 5000 2.3 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0 0 0 0 0
car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04
insurance 1 1 1000 3.0 3.0 2 1.3 3.9 2.03 6.09
Document: car insurance auto insuranceQuery: best car insurance
Exercise: what is N, the number of docs?
-
8/3/2019 cs707_013112
37/37
Summary vector space ranking
n Represent the query as a weighted tf-idf vectorn Represent each document as a weighted tf-idf vector
n Compute the cosine similarity score for the queryvector and each document vector
nRank documents with respect to the query by score
n Return the top K(e.g., K= 10) to the user
Prasad 37L08VSM-tfidf