cs707_020212

8/3/2019 cs707_020212

1/67

Prasad L09VSM-Ranking 1

Vector Space Model :

Ranking Revisited

Adapted from Lectures by

Prabhakar Raghavan (Yahoo and Stanford) and

Christopher Manning (Stanford)

8/3/2019 cs707_020212

2/67

Recap: tf-idf weightingn The tf-idf weight of a term is the product of its tf weight

and its idf weight.

n Best known weighting scheme in information retrievaln Increases with the number of occurrences within a

document

n Increases with the rarity of the term in the collection

)df/(log)tflog1(w 10,, tdt Ndt +=

Prasad 2L09VSM-Ranking

8/3/2019 cs707_020212

3/67

Recap: Queries as vectors

n Key idea 1: Do the same for queries: representthem as vectors in the space

n Key idea 2: Rank documents according to theirproximity to the query in this space

n proximity = similarity of vectors


8/3/2019 cs707_020212

4/67

Recap: cosine(query,document)

==

===

=

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dq

dq1

2

1

2

1

),cos(

Dot product Unit vectors

cos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.


8/3/2019 cs707_020212

5/67

This lecturen Speeding up vector space rankingn Putting together a complete search

system

n Will require learning about anumber of miscellaneous topics

and heuristics


8/3/2019 cs707_020212

6/67

Computing cosine scores

8/3/2019 cs707_020212

7/67

Efficient cosine rankingn Find the Kdocs in the collection nearest

to the query

Klargest query-doc cosines.n Efficient ranking:

n Computing a single cosine efficiently.n Choosing the Klargest cosine values

efficiently.

n Can we do this without computing all Ncosines?


8/3/2019 cs707_020212

8/67

(contd)n What were doing in effect: solving the K-

nearest neighbor problem for a query vector

n In general, we do not know how to do thisefficiently for high-dimensional spaces

n But it is solvable for short queries, andstandard indexes support this well


8/3/2019 cs707_020212

9/67

Special case unweighted queriesn No weighting on query terms

nAssume each query term occurs only once

n Then for ranking, dont need to normalizequery vector

n Slight simplification

7.1


8/3/2019 cs707_020212

10/67

Faster cosine: unweighted query

7.1

Readredundant?

8/3/2019 cs707_020212

11/67

Computing the Klargest cosines:

selection vs. sortingn Typically we want to retrieve the top Kdocs

(in the cosine ranking for the query)

nnot to totally order all docs in the collection

n Can we pick off docs with Khighest cosines?

n Let J= number of docs with nonzero cosinesn We seek the Kbest of these J

7.1


8/3/2019 cs707_020212

12/67

Use heap forselectingtop Kn Binary tree in which each nodes value

> the values of children

n Takes 2Joperations to construct, then eachofKwinners read off in 2log Jsteps.

n ForJ=10M, K=100, this is about 10% of thecost of sorting.

1

.9 .3

.8.3

.1

.1

7.1


8/3/2019 cs707_020212

13/67

Bottlenecks

n Primary computational bottleneck in scoring:cosine computation

n Can we avoid all this computation?n Yes, but may sometimes get it wrong

n a doc notin the top Kmay creep into thelist ofKoutput docs

n Is this such a bad thing?

7.1.1


8/3/2019 cs707_020212

14/67

Cosine similarity is only a proxy

n User has a task and a query formulationn Cosine matches docs to queryn Thus cosine is anyway a proxy for user

happiness

n If we get a list ofKdocs close to the top Kby cosine measure, should be ok

7.1.1


8/3/2019 cs707_020212

15/67

Generic approach

n Find a setA ofcontenders, with K < |A|

8/3/2019 cs707_020212

16/67

Index elimination

n Basic algorithm of Fig 7.1 considers only docscontaining at least one query term

n Take this further:n Consider only high-idf query termsn Consider only docs containing many query terms

7.1.2


8/3/2019 cs707_020212

17/67

High-idf query terms only

n For a query such as catcher in the ryen Only accumulate scores from catcherand rye

n Intuition: in and the contribute little to the scoresand dont alter rank-ordering much

n Benefit:n Postings of low-idf terms have many docs these (many) docs get eliminated fromA

7.1.2


8/3/2019 cs707_020212

18/67

Docs containing many query terms

n Any doc with at least one query term is acandidate for the top Koutput list

n For multi-term queries, only compute scores fordocs containing several of the query terms

n Say, at least 3 out of 4n Imposes a soft conjunction on queries seen on

web search engines (early Google)

n Easy to implement in postings traversal

7.1.2


8/3/2019 cs707_020212

19/67

3 of 4 query terms

BrutusCaesar

Calpurnia

1 2 3 5 8 13 21 342 4 8 16 32 64 128

13 16

Antony 3 4 8 16 32 64 128

32

Scores only computed for 8, 16 and 32.

7.1.2


8/3/2019 cs707_020212

20/67

Champion lists

n Precompute for each dictionary term t,the rdocs of highest weight in ts postings

n Call this the champion list fortn (aka fancy list or top docs fort)

n Note that rhas to be chosen at index timen At query time, only compute scores for docs in

the champion list of some query term

n Pick the Ktop-scoring docs from amongst these

7.1.3


8/3/2019 cs707_020212

21/67

Exercises

n How do Champion Lists relate to IndexElimination? Can they be used together?

n How can Champion Lists be implemented in aninverted index?

n Note the champion list has nothing to do withsmall docIDs

7.1.3


8/3/2019 cs707_020212

22/67

Quantitative

Static quality scores

n We want top-ranking documents to be bothrelevantand authoritative

n Relevance is being modeled by cosine scoresn Authorityis typically a query-independent

property of a document

n Examples of authority signalsn Wikipedia among websitesn Articles in certain newspapersn A paper with many citationsn Many diggs, Y!buzzes or del.icio.us marksn (Pagerank)

7.1.4

8/3/2019 cs707_020212

23/67

Modeling authority

n Assign to each document a query-independentquality score in [0,1] to each document d

n Denote this by g(d)

n Thus, a quantity like the number of citations isscaled into [0,1]

n Exercise: suggest a formula for this.

7.1.4


8/3/2019 cs707_020212

24/67

Net score

n Consider a simple total score combining cosinerelevance and authority

n net-score(q,d) = g(d) + cosine(q,d)n Can use some other linear combination than an

equal weighting

n Indeed, any function of the two signals of userhappiness more later

n Now we seek the top Kdocs by net score

7.1.4


8/3/2019 cs707_020212

25/67

Top Kby net score fast methods

n First idea: Order all postings by g(d)n Key: this is a common ordering for all postingsn

Thus, can concurrently traverse query terms

postings for

n Postings intersectionn Cosine score computation

n Exercise: write pseudocode for cosine scorecomputation if postings are ordered by g(d)

7.1.4


8/3/2019 cs707_020212

26/67

Why order postings by g(d)?

n Underg(d)-ordering, top-scoring docs likely toappear early in postings traversal

n In time-bound applications (say, we have toreturn whatever search results we can in 50 ms),

this allows us to stop postings traversal early

n Short of computing scores for all docs in postings

7.1.4


8/3/2019 cs707_020212

27/67

Champion lists in g(d)-ordering

n Can combine champion lists with g(d)-ordering

n Maintain for each term a champion list of the rdocs with highest g(d) + tf-idftd

n Seek top-Kresults from only the docs in thesechampion lists

7.1.4


8/3/2019 cs707_020212

28/67

High and low lists

n For each term, we maintain two postings listscalled high and low

n Think ofhigh as the champion listn When traversing postings on a query, only

traverse high lists first

n If we get more than Kdocs, select the top Kandstop

n Else proceed to get docs from the lowlistsn Can be used even for simple cosine scores,

without global quality g(d)

n A means for segmenting index into two tiers7.1.4

8/3/2019 cs707_020212

29/67

Impact-ordered postings

n We only want to compute scores for docs forwhich wft,d is high enough

n We sort each postings list by wft,d

n Now: not all postings in a common order!

nHow do we compute scores in order to pick offtop K?

n Two ideas follow

7.1.5


8/3/2019 cs707_020212

30/67

1. Early termination

n When traversing ts postings, stop early aftereither

n a fixed number ofrdocsn wft,d drops below some threshold

n Take the union of the resulting sets of docsn One from the postings of each query term

n Compute only the scores for docs in this union

7.1.5


8/3/2019 cs707_020212

31/67

2. idf-ordered terms

n When considering the postings of query termsn Look at them in order of decreasing idf

n High idf terms likely to contribute most to scoren As we update score contribution from each query

term

n Stop if doc scores relatively unchangedn Can apply to cosine or some other net scores

7.1.5


8/3/2019 cs707_020212

32/67

Cluster pruning: preprocessing

n Pick N docs at random: call theseleaders

n For every other doc, pre-computenearest leader

n Docs attached to a leader: itsfollowers;

n Likely: each leader has ~ Nfollowers.

7.1.6


8/3/2019 cs707_020212

33/67

Cluster pruning: query processing

n Process a query as follows:n Given query Q, find its nearest leader

L.n Seek Knearest docs from among Ls

followers.

7.1.6


8/3/2019 cs707_020212

34/67

Visualization

Query

Leader Follower7.1.6


8/3/2019 cs707_020212

35/67

Why use random sampling

n Fastn Leaders reflect data distribution

7.1.6


8/3/2019 cs707_020212

36/67

General variants

n Have each follower attached to b1=3 (say)nearest leaders.

n From query, find b2=4 (say) nearest leaders andtheir followers.

n Can recur on leader/follower construction.

7.1.6


8/3/2019 cs707_020212

37/67

Exercises

n To find the nearest leader in step 1, how manycosine computations do we do?

n Why did we have N in the first place?

n What is the effect of the constants b1, b2on theprevious slide?

n Devise an example where this is likely to fail i.e., we miss one of the Knearest docs.

n Likelyunder random sampling.

7.1.6


8/3/2019 cs707_020212

38/67

Parametric and zone indexes

n Thus far, a doc has been a sequence of termsn In fact documents have multiple parts, some with

special semantics:

n Authorn Titlen Date of publicationn Languagen Formatn etc.

n These constitute the metadata about a document

6.1


8/3/2019 cs707_020212

39/67

Fields

n We sometimes wish to search by these metadatan E.g., find docs authored by William Shakespeare

in the year 1601, containing alas poor Yorick

n Year = 1601 is an example of a fieldn Also, author last name = shakespeare, etcn Field or parametric index: postings for each field

value

n Sometimes build range trees (e.g., for dates)n Field query typically treated as conjunction

n (doc mustbe authored by shakespeare)

6.1


8/3/2019 cs707_020212

40/67

Zone

n A zone is a region of the doc that can contain anarbitrary amount of text e.g.,

n Titlen Abstractn References

n Build inverted indexes on zones as well to permitquerying

n E.g., find docs with merchantin the title zoneand matching the query gentle rain

6.1


8/3/2019 cs707_020212

41/67

Example zone indexes

6.1

Encode zones in dictionary vs. postings.

8/3/2019 cs707_020212

42/67

Tiered indexes

n Break postings up into a hierarchy of listsn Most importantn n Least important

n Can be done by g(d) or another measuren Inverted index thus broken up into tiers of

decreasing importance

n At query time use top tier unless it fails to yield Kdocs

n If so drop to lower tiers

7.2.1


8/3/2019 cs707_020212

43/67

Example tiered index

7.2.1

Prasad

8/3/2019 cs707_020212

44/67

Query term proximity

n Free text queries: just a set of terms typed intothe query box common on the web

n Users prefer docs in which query terms occurwithin close proximity of each other

n Let wbe the smallest window in a doc containingall query terms, e.g.,

n For the query strained mercythe smallestwindow in the doc The quality ofmercy is notstrainedis 4 (words)

n Would like scoring function to take this intoaccount how?

7.2.2

L09VSM-Ranking

8/3/2019 cs707_020212

45/67

Query parsers

n Free text query from user may in fact spawn oneor more queries to the indexes, e.g. query rising

interest rates

n Run the query as a phrase queryn If

8/3/2019 cs707_020212

46/67

Aggregate scores

n Weve seen that score functions can combinecosine, static quality, proximity, etc.

n How do we know the best combination?n Some applications expert-tuned

nIncreasingly common: machine-learned

7.2.3


8/3/2019 cs707_020212

47/67

Putting it all together

7.2.4


8/3/2019 cs707_020212

48/67

Evaluation of IR Systems"

Adapted from Lectures by

Prabhakar Raghavan (Yahoo and Stanford) and

Christopher Manning (Stanford)

Prasad 48L10Evaluation

8/3/2019 cs707_020212

49/67

49

This lecture

n Results summaries:n Making our good results usable to a user

n How do we know if our results are any good?n Evaluating a search engine

n Benchmarksn Precision and recall

Prasad L10Evaluation

8/3/2019 cs707_020212

50/67

50

Result Summaries

n Having ranked the documents matching a query,we wish to present a results list.

n Most commonly, a list of the document titles plusa short summary, aka 10 blue links.

8/3/2019 cs707_020212

51/67

51

Summaries

n The title is typically automatically extracted fromdocument metadata. What about the summaries?

n This description is crucial.n User can identify good/relevant hits based on description.

n Two basic kinds:n A static summary of a document is always the same,

regardless of the query that hit the doc.

n A dynamic summary is a query-dependentattempt toexplain why the document was retrieved for the query

at hand.


8/3/2019 cs707_020212

52/67

Static summaries

n In typical systems, the static summary is a subsetof the document.

n Simplest heuristic: the first 50 (or so this can bevaried) words of the documentn Summary cached at indexing time

n More sophisticated: extract from each document aset ofkey sentencesn Simple NLP heuristics to score each sentencen

Summary is made up of top-scoring sentences.n Most sophisticated: NLP used to synthesize a

summaryn Seldom used in IR (cf. text summarization work)

52

8/3/2019 cs707_020212

53/67

Dynamic summaries

n Present one or more windows within the documentthat contain several of the query terms

n KWIC snippets: Keyword in Context presentationn Generated in conjunction with scoring

n If query found as a phrase, all or some occurrences of the phrase in thedoc

n If not, document windows that contain multiple query terms

n The summary itself gives the entire content of thewindow all terms, not only the query terms.

8/3/2019 cs707_020212

54/67

54

Generating dynamic summaries

n If we have only a positional index, we cannot (easily)reconstruct context window surrounding hits.

n If we cache the documents at index time, then we canfind windows in it, cueing from hits found in thepositional index.

n E.g., positional index says the query is a phrase in position4378 so we go to this position in the cached document and

stream out the content

n Most often, cache only a fixed-size prefix of the doc.n Note: Cached copy can be outdated


8/3/2019 cs707_020212

55/67

55

Dynamic summaries

n Producing good dynamic summaries is a trickyoptimization problem

n The real estate for the summary is normally smalland fixed

n Want snippets to be long enough to be usefuln Want linguistically well-formed snippetsn Want snippets maximally informative about doc

n But users really like snippets, even if theycomplicate IR system design


8/3/2019 cs707_020212

56/67

Alternative results presentations?

n An active area of HCI researchn An alternative: http://www.searchme.com / copies the

idea of Apples Cover Flow for search results


8/3/2019 cs707_020212

57/67

Evaluating search engines


8/3/2019 cs707_020212

58/67

58

Measures for a search engine

n How fast does it indexn Number of documents/hourn (Average document size)

n How fast does it searchn Latency as a function of index size

n Expressiveness of query languagen Ability to express complex information needsn Speed on complex queries

n Uncluttered UIn Is it free?Prasad L10Evaluation

8/3/2019 cs707_020212

59/67

59

Measures for a search engine

n All of the preceding criteria are measurable: wecan quantify speed/size; we can make

expressiveness precise

n The key measure: user happinessn What is this?

n Speed of response/size of index are factorsn But blindingly fast, useless answers wont make a user

happy

n Need a way of quantifying user happiness


8/3/2019 cs707_020212

60/67

60

Data Retrieval vs Information Retrieval

n DR Performance Evaluation (after establishingcorrectness)

n Response timen Index space

n IR Performance Evaluationn How relevant is the answer set? (required to

establish functionalcorrectness, e.g., throughbenchmarks)


8/3/2019 cs707_020212

61/67

61

Measuring user happiness

n Issue: who is the user we are trying to makehappy?

n Depends on the setting/contextn Web engine: user finds what they want and

return to the engine

n Can measure rate of return usersn eCommerce site: user finds what they want and

make a purchase

n Is it the end-user, or the eCommerce site, whosehappiness we measure?

n Measure time to purchase, or fraction of searcherswho become buyers?


8/3/2019 cs707_020212

62/67

62

Measuring user happiness

n Enterprise (company/govt/academic): Care aboutuser productivity

n How much time do my users save when lookingfor information?

n Many other criteria having to do with breadth ofaccess, secure access, etc.


8/3/2019 cs707_020212

63/67

63

Happiness: elusive to measure

n Most common proxy: relevance of searchresults

n But how do you measure relevance?

n We will detail a methodology here, then examineits issues

n Relevant measurement requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A usually binary assessment of either Relevant or

Nonrelevant for each query and each document

n Some work on more-than-binary, but not the standardPrasad L10Evaluation

8/3/2019 cs707_020212

64/67

Evaluating an IR system

n Note: the information need is translated into aquery

n Relevance is assessed relative to theinformation need,notthequeryn E.g., Information need: I'm looking for information

on whether drinking red wine is more effective at

reducing heart attack risks than white wine.

n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the

information need, not whether it has these words


8/3/2019 cs707_020212

65/67

65

Difficulties with gauging Relevancy

n Relevancy, from a human standpoint, is:n Subjective: Depends upon a specific

users judgment.n Situational: Relates to users current

needs.

nCognitive: Depends on humanperception and behavior.

n Dynamic: Changes over time.Prasad L10Evaluation

8/3/2019 cs707_020212

66/67

Standard relevance benchmarks

n TREC - National Institute of Standards andTechnology (NIST) has run a large IR test bed for

many years

n Reuters and other benchmark doc collections usedn Retrieval tasks specified

n sometimes as queriesn

Human experts mark, for each query and foreach doc, Relevant or Nonrelevant

n or at least for subset of docs that some systemreturned for that query

Prasad 66

8/3/2019 cs707_020212

67/67

67

Unranked retrieval evaluation:

Precision and Recall

n Precision: fraction of retrieved docs that arerelevant = P(relevant|retrieved)

n Recall: fraction of relevant docs that are retrieved= P(retrieved|relevant)

n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn


cs707_020212

Documents

Transcript of cs707_020212