cs707_020212
-
Upload
cecsdistancelab -
Category
Documents
-
view
221 -
download
0
Transcript of cs707_020212
-
8/3/2019 cs707_020212
1/67
Prasad L09VSM-Ranking 1
Vector Space Model :
Ranking Revisited
Adapted from Lectures by
Prabhakar Raghavan (Yahoo and Stanford) and
Christopher Manning (Stanford)
-
8/3/2019 cs707_020212
2/67
Recap: tf-idf weightingn The tf-idf weight of a term is the product of its tf weight
and its idf weight.
n Best known weighting scheme in information retrievaln Increases with the number of occurrences within a
document
n Increases with the rarity of the term in the collection
)df/(log)tflog1(w 10,, tdt Ndt +=
Prasad 2L09VSM-Ranking
-
8/3/2019 cs707_020212
3/67
Recap: Queries as vectors
n Key idea 1: Do the same for queries: representthem as vectors in the space
n Key idea 2: Rank documents according to theirproximity to the query in this space
n proximity = similarity of vectors
Prasad 3L09VSM-Ranking
-
8/3/2019 cs707_020212
4/67
Recap: cosine(query,document)
==
===
=
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dq
dq1
2
1
2
1
),cos(
Dot product Unit vectors
cos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.
Prasad 4L09VSM-Ranking
-
8/3/2019 cs707_020212
5/67
This lecturen Speeding up vector space rankingn Putting together a complete search
system
n Will require learning about anumber of miscellaneous topics
and heuristics
Prasad 5L09VSM-Ranking
-
8/3/2019 cs707_020212
6/67
Computing cosine scores
-
8/3/2019 cs707_020212
7/67
Efficient cosine rankingn Find the Kdocs in the collection nearest
to the query
Klargest query-doc cosines.n Efficient ranking:
n Computing a single cosine efficiently.n Choosing the Klargest cosine values
efficiently.
n Can we do this without computing all Ncosines?
Prasad 7L09VSM-Ranking
-
8/3/2019 cs707_020212
8/67
(contd)n What were doing in effect: solving the K-
nearest neighbor problem for a query vector
n In general, we do not know how to do thisefficiently for high-dimensional spaces
n But it is solvable for short queries, andstandard indexes support this well
Prasad 8L09VSM-Ranking
-
8/3/2019 cs707_020212
9/67
Special case unweighted queriesn No weighting on query terms
nAssume each query term occurs only once
n Then for ranking, dont need to normalizequery vector
n Slight simplification
7.1
Prasad 9L09VSM-Ranking
-
8/3/2019 cs707_020212
10/67
Faster cosine: unweighted query
7.1
Readredundant?
-
8/3/2019 cs707_020212
11/67
Computing the Klargest cosines:
selection vs. sortingn Typically we want to retrieve the top Kdocs
(in the cosine ranking for the query)
nnot to totally order all docs in the collection
n Can we pick off docs with Khighest cosines?
n Let J= number of docs with nonzero cosinesn We seek the Kbest of these J
7.1
Prasad 11L09VSM-Ranking
-
8/3/2019 cs707_020212
12/67
Use heap forselectingtop Kn Binary tree in which each nodes value
> the values of children
n Takes 2Joperations to construct, then eachofKwinners read off in 2log Jsteps.
n ForJ=10M, K=100, this is about 10% of thecost of sorting.
1
.9 .3
.8.3
.1
.1
7.1
Prasad 12L09VSM-Ranking
-
8/3/2019 cs707_020212
13/67
Bottlenecks
n Primary computational bottleneck in scoring:cosine computation
n Can we avoid all this computation?n Yes, but may sometimes get it wrong
n a doc notin the top Kmay creep into thelist ofKoutput docs
n Is this such a bad thing?
7.1.1
Prasad 13L09VSM-Ranking
-
8/3/2019 cs707_020212
14/67
Cosine similarity is only a proxy
n User has a task and a query formulationn Cosine matches docs to queryn Thus cosine is anyway a proxy for user
happiness
n If we get a list ofKdocs close to the top Kby cosine measure, should be ok
7.1.1
Prasad 14L09VSM-Ranking
-
8/3/2019 cs707_020212
15/67
Generic approach
n Find a setA ofcontenders, with K < |A|
-
8/3/2019 cs707_020212
16/67
Index elimination
n Basic algorithm of Fig 7.1 considers only docscontaining at least one query term
n Take this further:n Consider only high-idf query termsn Consider only docs containing many query terms
7.1.2
Prasad 16L09VSM-Ranking
-
8/3/2019 cs707_020212
17/67
High-idf query terms only
n For a query such as catcher in the ryen Only accumulate scores from catcherand rye
n Intuition: in and the contribute little to the scoresand dont alter rank-ordering much
n Benefit:n Postings of low-idf terms have many docs these (many) docs get eliminated fromA
7.1.2
Prasad 17L09VSM-Ranking
-
8/3/2019 cs707_020212
18/67
Docs containing many query terms
n Any doc with at least one query term is acandidate for the top Koutput list
n For multi-term queries, only compute scores fordocs containing several of the query terms
n Say, at least 3 out of 4n Imposes a soft conjunction on queries seen on
web search engines (early Google)
n Easy to implement in postings traversal
7.1.2
Prasad 18L09VSM-Ranking
-
8/3/2019 cs707_020212
19/67
3 of 4 query terms
BrutusCaesar
Calpurnia
1 2 3 5 8 13 21 342 4 8 16 32 64 128
13 16
Antony 3 4 8 16 32 64 128
32
Scores only computed for 8, 16 and 32.
7.1.2
Prasad 19L09VSM-Ranking
-
8/3/2019 cs707_020212
20/67
Champion lists
n Precompute for each dictionary term t,the rdocs of highest weight in ts postings
n Call this the champion list fortn (aka fancy list or top docs fort)
n Note that rhas to be chosen at index timen At query time, only compute scores for docs in
the champion list of some query term
n Pick the Ktop-scoring docs from amongst these
7.1.3
Prasad 20L09VSM-Ranking
-
8/3/2019 cs707_020212
21/67
Exercises
n How do Champion Lists relate to IndexElimination? Can they be used together?
n How can Champion Lists be implemented in aninverted index?
n Note the champion list has nothing to do withsmall docIDs
7.1.3
Prasad 21L09VSM-Ranking
-
8/3/2019 cs707_020212
22/67
Quantitative
Static quality scores
n We want top-ranking documents to be bothrelevantand authoritative
n Relevance is being modeled by cosine scoresn Authorityis typically a query-independent
property of a document
n Examples of authority signalsn Wikipedia among websitesn Articles in certain newspapersn A paper with many citationsn Many diggs, Y!buzzes or del.icio.us marksn (Pagerank)
7.1.4
-
8/3/2019 cs707_020212
23/67
Modeling authority
n Assign to each document a query-independentquality score in [0,1] to each document d
n Denote this by g(d)
n Thus, a quantity like the number of citations isscaled into [0,1]
n Exercise: suggest a formula for this.
7.1.4
Prasad 23L09VSM-Ranking
-
8/3/2019 cs707_020212
24/67
Net score
n Consider a simple total score combining cosinerelevance and authority
n net-score(q,d) = g(d) + cosine(q,d)n Can use some other linear combination than an
equal weighting
n Indeed, any function of the two signals of userhappiness more later
n Now we seek the top Kdocs by net score
7.1.4
Prasad 24L09VSM-Ranking
-
8/3/2019 cs707_020212
25/67
Top Kby net score fast methods
n First idea: Order all postings by g(d)n Key: this is a common ordering for all postingsn
Thus, can concurrently traverse query terms
postings for
n Postings intersectionn Cosine score computation
n Exercise: write pseudocode for cosine scorecomputation if postings are ordered by g(d)
7.1.4
Prasad 25L09VSM-Ranking
-
8/3/2019 cs707_020212
26/67
Why order postings by g(d)?
n Underg(d)-ordering, top-scoring docs likely toappear early in postings traversal
n In time-bound applications (say, we have toreturn whatever search results we can in 50 ms),
this allows us to stop postings traversal early
n Short of computing scores for all docs in postings
7.1.4
Prasad 26L09VSM-Ranking
-
8/3/2019 cs707_020212
27/67
Champion lists in g(d)-ordering
n Can combine champion lists with g(d)-ordering
n Maintain for each term a champion list of the rdocs with highest g(d) + tf-idftd
n Seek top-Kresults from only the docs in thesechampion lists
7.1.4
Prasad 27L09VSM-Ranking
-
8/3/2019 cs707_020212
28/67
High and low lists
n For each term, we maintain two postings listscalled high and low
n Think ofhigh as the champion listn When traversing postings on a query, only
traverse high lists first
n If we get more than Kdocs, select the top Kandstop
n Else proceed to get docs from the lowlistsn Can be used even for simple cosine scores,
without global quality g(d)
n A means for segmenting index into two tiers7.1.4
-
8/3/2019 cs707_020212
29/67
Impact-ordered postings
n We only want to compute scores for docs forwhich wft,d is high enough
n We sort each postings list by wft,d
n Now: not all postings in a common order!
nHow do we compute scores in order to pick offtop K?
n Two ideas follow
7.1.5
Prasad 29L09VSM-Ranking
-
8/3/2019 cs707_020212
30/67
1. Early termination
n When traversing ts postings, stop early aftereither
n a fixed number ofrdocsn wft,d drops below some threshold
n Take the union of the resulting sets of docsn One from the postings of each query term
n Compute only the scores for docs in this union
7.1.5
Prasad 30L09VSM-Ranking
-
8/3/2019 cs707_020212
31/67
2. idf-ordered terms
n When considering the postings of query termsn Look at them in order of decreasing idf
n High idf terms likely to contribute most to scoren As we update score contribution from each query
term
n Stop if doc scores relatively unchangedn Can apply to cosine or some other net scores
7.1.5
Prasad 31L09VSM-Ranking
-
8/3/2019 cs707_020212
32/67
Cluster pruning: preprocessing
n Pick N docs at random: call theseleaders
n For every other doc, pre-computenearest leader
n Docs attached to a leader: itsfollowers;
n Likely: each leader has ~ Nfollowers.
7.1.6
Prasad 32L09VSM-Ranking
-
8/3/2019 cs707_020212
33/67
Cluster pruning: query processing
n Process a query as follows:n Given query Q, find its nearest leader
L.n Seek Knearest docs from among Ls
followers.
7.1.6
Prasad 33L09VSM-Ranking
-
8/3/2019 cs707_020212
34/67
Visualization
Query
Leader Follower7.1.6
Prasad 34L09VSM-Ranking
-
8/3/2019 cs707_020212
35/67
Why use random sampling
n Fastn Leaders reflect data distribution
7.1.6
Prasad 35L09VSM-Ranking
-
8/3/2019 cs707_020212
36/67
General variants
n Have each follower attached to b1=3 (say)nearest leaders.
n From query, find b2=4 (say) nearest leaders andtheir followers.
n Can recur on leader/follower construction.
7.1.6
Prasad 36L09VSM-Ranking
-
8/3/2019 cs707_020212
37/67
Exercises
n To find the nearest leader in step 1, how manycosine computations do we do?
n Why did we have N in the first place?
n What is the effect of the constants b1, b2on theprevious slide?
n Devise an example where this is likely to fail i.e., we miss one of the Knearest docs.
n Likelyunder random sampling.
7.1.6
Prasad 37L09VSM-Ranking
-
8/3/2019 cs707_020212
38/67
Parametric and zone indexes
n Thus far, a doc has been a sequence of termsn In fact documents have multiple parts, some with
special semantics:
n Authorn Titlen Date of publicationn Languagen Formatn etc.
n These constitute the metadata about a document
6.1
Prasad 38L09VSM-Ranking
-
8/3/2019 cs707_020212
39/67
Fields
n We sometimes wish to search by these metadatan E.g., find docs authored by William Shakespeare
in the year 1601, containing alas poor Yorick
n Year = 1601 is an example of a fieldn Also, author last name = shakespeare, etcn Field or parametric index: postings for each field
value
n Sometimes build range trees (e.g., for dates)n Field query typically treated as conjunction
n (doc mustbe authored by shakespeare)
6.1
Prasad 39L09VSM-Ranking
-
8/3/2019 cs707_020212
40/67
Zone
n A zone is a region of the doc that can contain anarbitrary amount of text e.g.,
n Titlen Abstractn References
n Build inverted indexes on zones as well to permitquerying
n E.g., find docs with merchantin the title zoneand matching the query gentle rain
6.1
Prasad 40L09VSM-Ranking
-
8/3/2019 cs707_020212
41/67
Example zone indexes
6.1
Encode zones in dictionary vs. postings.
-
8/3/2019 cs707_020212
42/67
Tiered indexes
n Break postings up into a hierarchy of listsn Most importantn n Least important
n Can be done by g(d) or another measuren Inverted index thus broken up into tiers of
decreasing importance
n At query time use top tier unless it fails to yield Kdocs
n If so drop to lower tiers
7.2.1
Prasad 42L09VSM-Ranking
-
8/3/2019 cs707_020212
43/67
Example tiered index
7.2.1
Prasad
-
8/3/2019 cs707_020212
44/67
Query term proximity
n Free text queries: just a set of terms typed intothe query box common on the web
n Users prefer docs in which query terms occurwithin close proximity of each other
n Let wbe the smallest window in a doc containingall query terms, e.g.,
n For the query strained mercythe smallestwindow in the doc The quality ofmercy is notstrainedis 4 (words)
n Would like scoring function to take this intoaccount how?
7.2.2
L09VSM-Ranking
-
8/3/2019 cs707_020212
45/67
Query parsers
n Free text query from user may in fact spawn oneor more queries to the indexes, e.g. query rising
interest rates
n Run the query as a phrase queryn If
-
8/3/2019 cs707_020212
46/67
Aggregate scores
n Weve seen that score functions can combinecosine, static quality, proximity, etc.
n How do we know the best combination?n Some applications expert-tuned
nIncreasingly common: machine-learned
7.2.3
Prasad 46L09VSM-Ranking
-
8/3/2019 cs707_020212
47/67
Putting it all together
7.2.4
Prasad 47L09VSM-Ranking
-
8/3/2019 cs707_020212
48/67
Evaluation of IR Systems"
Adapted from Lectures by
Prabhakar Raghavan (Yahoo and Stanford) and
Christopher Manning (Stanford)
Prasad 48L10Evaluation
-
8/3/2019 cs707_020212
49/67
49
This lecture
n Results summaries:n Making our good results usable to a user
n How do we know if our results are any good?n Evaluating a search engine
n Benchmarksn Precision and recall
Prasad L10Evaluation
-
8/3/2019 cs707_020212
50/67
50
Result Summaries
n Having ranked the documents matching a query,we wish to present a results list.
n Most commonly, a list of the document titles plusa short summary, aka 10 blue links.
-
8/3/2019 cs707_020212
51/67
51
Summaries
n The title is typically automatically extracted fromdocument metadata. What about the summaries?
n This description is crucial.n User can identify good/relevant hits based on description.
n Two basic kinds:n A static summary of a document is always the same,
regardless of the query that hit the doc.
n A dynamic summary is a query-dependentattempt toexplain why the document was retrieved for the query
at hand.
Prasad L10Evaluation
-
8/3/2019 cs707_020212
52/67
Static summaries
n In typical systems, the static summary is a subsetof the document.
n Simplest heuristic: the first 50 (or so this can bevaried) words of the documentn Summary cached at indexing time
n More sophisticated: extract from each document aset ofkey sentencesn Simple NLP heuristics to score each sentencen
Summary is made up of top-scoring sentences.n Most sophisticated: NLP used to synthesize a
summaryn Seldom used in IR (cf. text summarization work)
52
-
8/3/2019 cs707_020212
53/67
Dynamic summaries
n Present one or more windows within the documentthat contain several of the query terms
n KWIC snippets: Keyword in Context presentationn Generated in conjunction with scoring
n If query found as a phrase, all or some occurrences of the phrase in thedoc
n If not, document windows that contain multiple query terms
n The summary itself gives the entire content of thewindow all terms, not only the query terms.
-
8/3/2019 cs707_020212
54/67
54
Generating dynamic summaries
n If we have only a positional index, we cannot (easily)reconstruct context window surrounding hits.
n If we cache the documents at index time, then we canfind windows in it, cueing from hits found in thepositional index.
n E.g., positional index says the query is a phrase in position4378 so we go to this position in the cached document and
stream out the content
n Most often, cache only a fixed-size prefix of the doc.n Note: Cached copy can be outdated
Prasad L10Evaluation
-
8/3/2019 cs707_020212
55/67
55
Dynamic summaries
n Producing good dynamic summaries is a trickyoptimization problem
n The real estate for the summary is normally smalland fixed
n Want snippets to be long enough to be usefuln Want linguistically well-formed snippetsn Want snippets maximally informative about doc
n But users really like snippets, even if theycomplicate IR system design
Prasad L10Evaluation
-
8/3/2019 cs707_020212
56/67
Alternative results presentations?
n An active area of HCI researchn An alternative: http://www.searchme.com / copies the
idea of Apples Cover Flow for search results
Prasad 56L10Evaluation
-
8/3/2019 cs707_020212
57/67
Evaluating search engines
Prasad 57L10Evaluation
-
8/3/2019 cs707_020212
58/67
58
Measures for a search engine
n How fast does it indexn Number of documents/hourn (Average document size)
n How fast does it searchn Latency as a function of index size
n Expressiveness of query languagen Ability to express complex information needsn Speed on complex queries
n Uncluttered UIn Is it free?Prasad L10Evaluation
-
8/3/2019 cs707_020212
59/67
59
Measures for a search engine
n All of the preceding criteria are measurable: wecan quantify speed/size; we can make
expressiveness precise
n The key measure: user happinessn What is this?
n Speed of response/size of index are factorsn But blindingly fast, useless answers wont make a user
happy
n Need a way of quantifying user happiness
Prasad L10Evaluation
-
8/3/2019 cs707_020212
60/67
60
Data Retrieval vs Information Retrieval
n DR Performance Evaluation (after establishingcorrectness)
n Response timen Index space
n IR Performance Evaluationn How relevant is the answer set? (required to
establish functionalcorrectness, e.g., throughbenchmarks)
Prasad L10Evaluation
-
8/3/2019 cs707_020212
61/67
61
Measuring user happiness
n Issue: who is the user we are trying to makehappy?
n Depends on the setting/contextn Web engine: user finds what they want and
return to the engine
n Can measure rate of return usersn eCommerce site: user finds what they want and
make a purchase
n Is it the end-user, or the eCommerce site, whosehappiness we measure?
n Measure time to purchase, or fraction of searcherswho become buyers?
Prasad L10Evaluation
-
8/3/2019 cs707_020212
62/67
62
Measuring user happiness
n Enterprise (company/govt/academic): Care aboutuser productivity
n How much time do my users save when lookingfor information?
n Many other criteria having to do with breadth ofaccess, secure access, etc.
Prasad L10Evaluation
-
8/3/2019 cs707_020212
63/67
63
Happiness: elusive to measure
n Most common proxy: relevance of searchresults
n But how do you measure relevance?
n We will detail a methodology here, then examineits issues
n Relevant measurement requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A usually binary assessment of either Relevant or
Nonrelevant for each query and each document
n Some work on more-than-binary, but not the standardPrasad L10Evaluation
-
8/3/2019 cs707_020212
64/67
Evaluating an IR system
n Note: the information need is translated into aquery
n Relevance is assessed relative to theinformation need,notthequeryn E.g., Information need: I'm looking for information
on whether drinking red wine is more effective at
reducing heart attack risks than white wine.
n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the
information need, not whether it has these words
Prasad 64L10Evaluation
-
8/3/2019 cs707_020212
65/67
65
Difficulties with gauging Relevancy
n Relevancy, from a human standpoint, is:n Subjective: Depends upon a specific
users judgment.n Situational: Relates to users current
needs.
nCognitive: Depends on humanperception and behavior.
n Dynamic: Changes over time.Prasad L10Evaluation
-
8/3/2019 cs707_020212
66/67
Standard relevance benchmarks
n TREC - National Institute of Standards andTechnology (NIST) has run a large IR test bed for
many years
n Reuters and other benchmark doc collections usedn Retrieval tasks specified
n sometimes as queriesn
Human experts mark, for each query and foreach doc, Relevant or Nonrelevant
n or at least for subset of docs that some systemreturned for that query
Prasad 66
-
8/3/2019 cs707_020212
67/67
67
Unranked retrieval evaluation:
Precision and Recall
n Precision: fraction of retrieved docs that arerelevant = P(relevant|retrieved)
n Recall: fraction of relevant docs that are retrieved= P(retrieved|relevant)
n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
Prasad L10Evaluation