cs707_020212

download cs707_020212

of 67

Transcript of cs707_020212

  • 8/3/2019 cs707_020212

    1/67

    Prasad L09VSM-Ranking 1

    Vector Space Model :

    Ranking Revisited

    Adapted from Lectures by

    Prabhakar Raghavan (Yahoo and Stanford) and

    Christopher Manning (Stanford)

  • 8/3/2019 cs707_020212

    2/67

    Recap: tf-idf weightingn The tf-idf weight of a term is the product of its tf weight

    and its idf weight.

    n Best known weighting scheme in information retrievaln Increases with the number of occurrences within a

    document

    n Increases with the rarity of the term in the collection

    )df/(log)tflog1(w 10,, tdt Ndt +=

    Prasad 2L09VSM-Ranking

  • 8/3/2019 cs707_020212

    3/67

    Recap: Queries as vectors

    n Key idea 1: Do the same for queries: representthem as vectors in the space

    n Key idea 2: Rank documents according to theirproximity to the query in this space

    n proximity = similarity of vectors

    Prasad 3L09VSM-Ranking

  • 8/3/2019 cs707_020212

    4/67

    Recap: cosine(query,document)

    ==

    ===

    =

    V

    i i

    V

    i i

    V

    i ii

    dq

    dq

    d

    d

    q

    q

    dq

    dq

    dq1

    2

    1

    2

    1

    ),cos(

    Dot product Unit vectors

    cos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.

    Prasad 4L09VSM-Ranking

  • 8/3/2019 cs707_020212

    5/67

    This lecturen Speeding up vector space rankingn Putting together a complete search

    system

    n Will require learning about anumber of miscellaneous topics

    and heuristics

    Prasad 5L09VSM-Ranking

  • 8/3/2019 cs707_020212

    6/67

    Computing cosine scores

  • 8/3/2019 cs707_020212

    7/67

    Efficient cosine rankingn Find the Kdocs in the collection nearest

    to the query

    Klargest query-doc cosines.n Efficient ranking:

    n Computing a single cosine efficiently.n Choosing the Klargest cosine values

    efficiently.

    n Can we do this without computing all Ncosines?

    Prasad 7L09VSM-Ranking

  • 8/3/2019 cs707_020212

    8/67

    (contd)n What were doing in effect: solving the K-

    nearest neighbor problem for a query vector

    n In general, we do not know how to do thisefficiently for high-dimensional spaces

    n But it is solvable for short queries, andstandard indexes support this well

    Prasad 8L09VSM-Ranking

  • 8/3/2019 cs707_020212

    9/67

    Special case unweighted queriesn No weighting on query terms

    nAssume each query term occurs only once

    n Then for ranking, dont need to normalizequery vector

    n Slight simplification

    7.1

    Prasad 9L09VSM-Ranking

  • 8/3/2019 cs707_020212

    10/67

    Faster cosine: unweighted query

    7.1

    Readredundant?

  • 8/3/2019 cs707_020212

    11/67

    Computing the Klargest cosines:

    selection vs. sortingn Typically we want to retrieve the top Kdocs

    (in the cosine ranking for the query)

    nnot to totally order all docs in the collection

    n Can we pick off docs with Khighest cosines?

    n Let J= number of docs with nonzero cosinesn We seek the Kbest of these J

    7.1

    Prasad 11L09VSM-Ranking

  • 8/3/2019 cs707_020212

    12/67

    Use heap forselectingtop Kn Binary tree in which each nodes value

    > the values of children

    n Takes 2Joperations to construct, then eachofKwinners read off in 2log Jsteps.

    n ForJ=10M, K=100, this is about 10% of thecost of sorting.

    1

    .9 .3

    .8.3

    .1

    .1

    7.1

    Prasad 12L09VSM-Ranking

  • 8/3/2019 cs707_020212

    13/67

    Bottlenecks

    n Primary computational bottleneck in scoring:cosine computation

    n Can we avoid all this computation?n Yes, but may sometimes get it wrong

    n a doc notin the top Kmay creep into thelist ofKoutput docs

    n Is this such a bad thing?

    7.1.1

    Prasad 13L09VSM-Ranking

  • 8/3/2019 cs707_020212

    14/67

    Cosine similarity is only a proxy

    n User has a task and a query formulationn Cosine matches docs to queryn Thus cosine is anyway a proxy for user

    happiness

    n If we get a list ofKdocs close to the top Kby cosine measure, should be ok

    7.1.1

    Prasad 14L09VSM-Ranking

  • 8/3/2019 cs707_020212

    15/67

    Generic approach

    n Find a setA ofcontenders, with K < |A|

  • 8/3/2019 cs707_020212

    16/67

    Index elimination

    n Basic algorithm of Fig 7.1 considers only docscontaining at least one query term

    n Take this further:n Consider only high-idf query termsn Consider only docs containing many query terms

    7.1.2

    Prasad 16L09VSM-Ranking

  • 8/3/2019 cs707_020212

    17/67

    High-idf query terms only

    n For a query such as catcher in the ryen Only accumulate scores from catcherand rye

    n Intuition: in and the contribute little to the scoresand dont alter rank-ordering much

    n Benefit:n Postings of low-idf terms have many docs these (many) docs get eliminated fromA

    7.1.2

    Prasad 17L09VSM-Ranking

  • 8/3/2019 cs707_020212

    18/67

    Docs containing many query terms

    n Any doc with at least one query term is acandidate for the top Koutput list

    n For multi-term queries, only compute scores fordocs containing several of the query terms

    n Say, at least 3 out of 4n Imposes a soft conjunction on queries seen on

    web search engines (early Google)

    n Easy to implement in postings traversal

    7.1.2

    Prasad 18L09VSM-Ranking

  • 8/3/2019 cs707_020212

    19/67

    3 of 4 query terms

    BrutusCaesar

    Calpurnia

    1 2 3 5 8 13 21 342 4 8 16 32 64 128

    13 16

    Antony 3 4 8 16 32 64 128

    32

    Scores only computed for 8, 16 and 32.

    7.1.2

    Prasad 19L09VSM-Ranking

  • 8/3/2019 cs707_020212

    20/67

    Champion lists

    n Precompute for each dictionary term t,the rdocs of highest weight in ts postings

    n Call this the champion list fortn (aka fancy list or top docs fort)

    n Note that rhas to be chosen at index timen At query time, only compute scores for docs in

    the champion list of some query term

    n Pick the Ktop-scoring docs from amongst these

    7.1.3

    Prasad 20L09VSM-Ranking

  • 8/3/2019 cs707_020212

    21/67

    Exercises

    n How do Champion Lists relate to IndexElimination? Can they be used together?

    n How can Champion Lists be implemented in aninverted index?

    n Note the champion list has nothing to do withsmall docIDs

    7.1.3

    Prasad 21L09VSM-Ranking

  • 8/3/2019 cs707_020212

    22/67

    Quantitative

    Static quality scores

    n We want top-ranking documents to be bothrelevantand authoritative

    n Relevance is being modeled by cosine scoresn Authorityis typically a query-independent

    property of a document

    n Examples of authority signalsn Wikipedia among websitesn Articles in certain newspapersn A paper with many citationsn Many diggs, Y!buzzes or del.icio.us marksn (Pagerank)

    7.1.4

  • 8/3/2019 cs707_020212

    23/67

    Modeling authority

    n Assign to each document a query-independentquality score in [0,1] to each document d

    n Denote this by g(d)

    n Thus, a quantity like the number of citations isscaled into [0,1]

    n Exercise: suggest a formula for this.

    7.1.4

    Prasad 23L09VSM-Ranking

  • 8/3/2019 cs707_020212

    24/67

    Net score

    n Consider a simple total score combining cosinerelevance and authority

    n net-score(q,d) = g(d) + cosine(q,d)n Can use some other linear combination than an

    equal weighting

    n Indeed, any function of the two signals of userhappiness more later

    n Now we seek the top Kdocs by net score

    7.1.4

    Prasad 24L09VSM-Ranking

  • 8/3/2019 cs707_020212

    25/67

    Top Kby net score fast methods

    n First idea: Order all postings by g(d)n Key: this is a common ordering for all postingsn

    Thus, can concurrently traverse query terms

    postings for

    n Postings intersectionn Cosine score computation

    n Exercise: write pseudocode for cosine scorecomputation if postings are ordered by g(d)

    7.1.4

    Prasad 25L09VSM-Ranking

  • 8/3/2019 cs707_020212

    26/67

    Why order postings by g(d)?

    n Underg(d)-ordering, top-scoring docs likely toappear early in postings traversal

    n In time-bound applications (say, we have toreturn whatever search results we can in 50 ms),

    this allows us to stop postings traversal early

    n Short of computing scores for all docs in postings

    7.1.4

    Prasad 26L09VSM-Ranking

  • 8/3/2019 cs707_020212

    27/67

    Champion lists in g(d)-ordering

    n Can combine champion lists with g(d)-ordering

    n Maintain for each term a champion list of the rdocs with highest g(d) + tf-idftd

    n Seek top-Kresults from only the docs in thesechampion lists

    7.1.4

    Prasad 27L09VSM-Ranking

  • 8/3/2019 cs707_020212

    28/67

    High and low lists

    n For each term, we maintain two postings listscalled high and low

    n Think ofhigh as the champion listn When traversing postings on a query, only

    traverse high lists first

    n If we get more than Kdocs, select the top Kandstop

    n Else proceed to get docs from the lowlistsn Can be used even for simple cosine scores,

    without global quality g(d)

    n A means for segmenting index into two tiers7.1.4

  • 8/3/2019 cs707_020212

    29/67

    Impact-ordered postings

    n We only want to compute scores for docs forwhich wft,d is high enough

    n We sort each postings list by wft,d

    n Now: not all postings in a common order!

    nHow do we compute scores in order to pick offtop K?

    n Two ideas follow

    7.1.5

    Prasad 29L09VSM-Ranking

  • 8/3/2019 cs707_020212

    30/67

    1. Early termination

    n When traversing ts postings, stop early aftereither

    n a fixed number ofrdocsn wft,d drops below some threshold

    n Take the union of the resulting sets of docsn One from the postings of each query term

    n Compute only the scores for docs in this union

    7.1.5

    Prasad 30L09VSM-Ranking

  • 8/3/2019 cs707_020212

    31/67

    2. idf-ordered terms

    n When considering the postings of query termsn Look at them in order of decreasing idf

    n High idf terms likely to contribute most to scoren As we update score contribution from each query

    term

    n Stop if doc scores relatively unchangedn Can apply to cosine or some other net scores

    7.1.5

    Prasad 31L09VSM-Ranking

  • 8/3/2019 cs707_020212

    32/67

    Cluster pruning: preprocessing

    n Pick N docs at random: call theseleaders

    n For every other doc, pre-computenearest leader

    n Docs attached to a leader: itsfollowers;

    n Likely: each leader has ~ Nfollowers.

    7.1.6

    Prasad 32L09VSM-Ranking

  • 8/3/2019 cs707_020212

    33/67

    Cluster pruning: query processing

    n Process a query as follows:n Given query Q, find its nearest leader

    L.n Seek Knearest docs from among Ls

    followers.

    7.1.6

    Prasad 33L09VSM-Ranking

  • 8/3/2019 cs707_020212

    34/67

    Visualization

    Query

    Leader Follower7.1.6

    Prasad 34L09VSM-Ranking

  • 8/3/2019 cs707_020212

    35/67

    Why use random sampling

    n Fastn Leaders reflect data distribution

    7.1.6

    Prasad 35L09VSM-Ranking

  • 8/3/2019 cs707_020212

    36/67

    General variants

    n Have each follower attached to b1=3 (say)nearest leaders.

    n From query, find b2=4 (say) nearest leaders andtheir followers.

    n Can recur on leader/follower construction.

    7.1.6

    Prasad 36L09VSM-Ranking

  • 8/3/2019 cs707_020212

    37/67

    Exercises

    n To find the nearest leader in step 1, how manycosine computations do we do?

    n Why did we have N in the first place?

    n What is the effect of the constants b1, b2on theprevious slide?

    n Devise an example where this is likely to fail i.e., we miss one of the Knearest docs.

    n Likelyunder random sampling.

    7.1.6

    Prasad 37L09VSM-Ranking

  • 8/3/2019 cs707_020212

    38/67

    Parametric and zone indexes

    n Thus far, a doc has been a sequence of termsn In fact documents have multiple parts, some with

    special semantics:

    n Authorn Titlen Date of publicationn Languagen Formatn etc.

    n These constitute the metadata about a document

    6.1

    Prasad 38L09VSM-Ranking

  • 8/3/2019 cs707_020212

    39/67

    Fields

    n We sometimes wish to search by these metadatan E.g., find docs authored by William Shakespeare

    in the year 1601, containing alas poor Yorick

    n Year = 1601 is an example of a fieldn Also, author last name = shakespeare, etcn Field or parametric index: postings for each field

    value

    n Sometimes build range trees (e.g., for dates)n Field query typically treated as conjunction

    n (doc mustbe authored by shakespeare)

    6.1

    Prasad 39L09VSM-Ranking

  • 8/3/2019 cs707_020212

    40/67

    Zone

    n A zone is a region of the doc that can contain anarbitrary amount of text e.g.,

    n Titlen Abstractn References

    n Build inverted indexes on zones as well to permitquerying

    n E.g., find docs with merchantin the title zoneand matching the query gentle rain

    6.1

    Prasad 40L09VSM-Ranking

  • 8/3/2019 cs707_020212

    41/67

    Example zone indexes

    6.1

    Encode zones in dictionary vs. postings.

  • 8/3/2019 cs707_020212

    42/67

    Tiered indexes

    n Break postings up into a hierarchy of listsn Most importantn n Least important

    n Can be done by g(d) or another measuren Inverted index thus broken up into tiers of

    decreasing importance

    n At query time use top tier unless it fails to yield Kdocs

    n If so drop to lower tiers

    7.2.1

    Prasad 42L09VSM-Ranking

  • 8/3/2019 cs707_020212

    43/67

    Example tiered index

    7.2.1

    Prasad

  • 8/3/2019 cs707_020212

    44/67

    Query term proximity

    n Free text queries: just a set of terms typed intothe query box common on the web

    n Users prefer docs in which query terms occurwithin close proximity of each other

    n Let wbe the smallest window in a doc containingall query terms, e.g.,

    n For the query strained mercythe smallestwindow in the doc The quality ofmercy is notstrainedis 4 (words)

    n Would like scoring function to take this intoaccount how?

    7.2.2

    L09VSM-Ranking

  • 8/3/2019 cs707_020212

    45/67

    Query parsers

    n Free text query from user may in fact spawn oneor more queries to the indexes, e.g. query rising

    interest rates

    n Run the query as a phrase queryn If

  • 8/3/2019 cs707_020212

    46/67

    Aggregate scores

    n Weve seen that score functions can combinecosine, static quality, proximity, etc.

    n How do we know the best combination?n Some applications expert-tuned

    nIncreasingly common: machine-learned

    7.2.3

    Prasad 46L09VSM-Ranking

  • 8/3/2019 cs707_020212

    47/67

    Putting it all together

    7.2.4

    Prasad 47L09VSM-Ranking

  • 8/3/2019 cs707_020212

    48/67

    Evaluation of IR Systems"

    Adapted from Lectures by

    Prabhakar Raghavan (Yahoo and Stanford) and

    Christopher Manning (Stanford)

    Prasad 48L10Evaluation

  • 8/3/2019 cs707_020212

    49/67

    49

    This lecture

    n Results summaries:n Making our good results usable to a user

    n How do we know if our results are any good?n Evaluating a search engine

    n Benchmarksn Precision and recall

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    50/67

    50

    Result Summaries

    n Having ranked the documents matching a query,we wish to present a results list.

    n Most commonly, a list of the document titles plusa short summary, aka 10 blue links.

  • 8/3/2019 cs707_020212

    51/67

    51

    Summaries

    n The title is typically automatically extracted fromdocument metadata. What about the summaries?

    n This description is crucial.n User can identify good/relevant hits based on description.

    n Two basic kinds:n A static summary of a document is always the same,

    regardless of the query that hit the doc.

    n A dynamic summary is a query-dependentattempt toexplain why the document was retrieved for the query

    at hand.

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    52/67

    Static summaries

    n In typical systems, the static summary is a subsetof the document.

    n Simplest heuristic: the first 50 (or so this can bevaried) words of the documentn Summary cached at indexing time

    n More sophisticated: extract from each document aset ofkey sentencesn Simple NLP heuristics to score each sentencen

    Summary is made up of top-scoring sentences.n Most sophisticated: NLP used to synthesize a

    summaryn Seldom used in IR (cf. text summarization work)

    52

  • 8/3/2019 cs707_020212

    53/67

    Dynamic summaries

    n Present one or more windows within the documentthat contain several of the query terms

    n KWIC snippets: Keyword in Context presentationn Generated in conjunction with scoring

    n If query found as a phrase, all or some occurrences of the phrase in thedoc

    n If not, document windows that contain multiple query terms

    n The summary itself gives the entire content of thewindow all terms, not only the query terms.

  • 8/3/2019 cs707_020212

    54/67

    54

    Generating dynamic summaries

    n If we have only a positional index, we cannot (easily)reconstruct context window surrounding hits.

    n If we cache the documents at index time, then we canfind windows in it, cueing from hits found in thepositional index.

    n E.g., positional index says the query is a phrase in position4378 so we go to this position in the cached document and

    stream out the content

    n Most often, cache only a fixed-size prefix of the doc.n Note: Cached copy can be outdated

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    55/67

    55

    Dynamic summaries

    n Producing good dynamic summaries is a trickyoptimization problem

    n The real estate for the summary is normally smalland fixed

    n Want snippets to be long enough to be usefuln Want linguistically well-formed snippetsn Want snippets maximally informative about doc

    n But users really like snippets, even if theycomplicate IR system design

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    56/67

    Alternative results presentations?

    n An active area of HCI researchn An alternative: http://www.searchme.com / copies the

    idea of Apples Cover Flow for search results

    Prasad 56L10Evaluation

  • 8/3/2019 cs707_020212

    57/67

    Evaluating search engines

    Prasad 57L10Evaluation

  • 8/3/2019 cs707_020212

    58/67

    58

    Measures for a search engine

    n How fast does it indexn Number of documents/hourn (Average document size)

    n How fast does it searchn Latency as a function of index size

    n Expressiveness of query languagen Ability to express complex information needsn Speed on complex queries

    n Uncluttered UIn Is it free?Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    59/67

    59

    Measures for a search engine

    n All of the preceding criteria are measurable: wecan quantify speed/size; we can make

    expressiveness precise

    n The key measure: user happinessn What is this?

    n Speed of response/size of index are factorsn But blindingly fast, useless answers wont make a user

    happy

    n Need a way of quantifying user happiness

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    60/67

    60

    Data Retrieval vs Information Retrieval

    n DR Performance Evaluation (after establishingcorrectness)

    n Response timen Index space

    n IR Performance Evaluationn How relevant is the answer set? (required to

    establish functionalcorrectness, e.g., throughbenchmarks)

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    61/67

    61

    Measuring user happiness

    n Issue: who is the user we are trying to makehappy?

    n Depends on the setting/contextn Web engine: user finds what they want and

    return to the engine

    n Can measure rate of return usersn eCommerce site: user finds what they want and

    make a purchase

    n Is it the end-user, or the eCommerce site, whosehappiness we measure?

    n Measure time to purchase, or fraction of searcherswho become buyers?

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    62/67

    62

    Measuring user happiness

    n Enterprise (company/govt/academic): Care aboutuser productivity

    n How much time do my users save when lookingfor information?

    n Many other criteria having to do with breadth ofaccess, secure access, etc.

    Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    63/67

    63

    Happiness: elusive to measure

    n Most common proxy: relevance of searchresults

    n But how do you measure relevance?

    n We will detail a methodology here, then examineits issues

    n Relevant measurement requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A usually binary assessment of either Relevant or

    Nonrelevant for each query and each document

    n Some work on more-than-binary, but not the standardPrasad L10Evaluation

  • 8/3/2019 cs707_020212

    64/67

    Evaluating an IR system

    n Note: the information need is translated into aquery

    n Relevance is assessed relative to theinformation need,notthequeryn E.g., Information need: I'm looking for information

    on whether drinking red wine is more effective at

    reducing heart attack risks than white wine.

    n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the

    information need, not whether it has these words

    Prasad 64L10Evaluation

  • 8/3/2019 cs707_020212

    65/67

    65

    Difficulties with gauging Relevancy

    n Relevancy, from a human standpoint, is:n Subjective: Depends upon a specific

    users judgment.n Situational: Relates to users current

    needs.

    nCognitive: Depends on humanperception and behavior.

    n Dynamic: Changes over time.Prasad L10Evaluation

  • 8/3/2019 cs707_020212

    66/67

    Standard relevance benchmarks

    n TREC - National Institute of Standards andTechnology (NIST) has run a large IR test bed for

    many years

    n Reuters and other benchmark doc collections usedn Retrieval tasks specified

    n sometimes as queriesn

    Human experts mark, for each query and foreach doc, Relevant or Nonrelevant

    n or at least for subset of docs that some systemreturned for that query

    Prasad 66

  • 8/3/2019 cs707_020212

    67/67

    67

    Unranked retrieval evaluation:

    Precision and Recall

    n Precision: fraction of retrieved docs that arerelevant = P(relevant|retrieved)

    n Recall: fraction of relevant docs that are retrieved= P(retrieved|relevant)

    n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)

    Relevant Nonrelevant

    Retrieved tp fp

    Not Retrieved fn tn

    Prasad L10Evaluation