Web characteristics, broadly defined Thanks to C. Manning, P. Raghavan, H. Schutze.
Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information...
Transcript of Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information...
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 1Boolean retrieval
Information Retrieval: IR
n Finding material (usually document) of an unstructured nature (usually text) that satisfies an information need from within large collections
n Started in the 50’s. SIGIR (80), TREC (92)n The field of IR also covers supporting users in
browsing or filtering document collections or further processing a set of retrieved documentsn clusteringn classification
n Scale: from web search to personal information retrieval
How good are the retrieved docs?
n Precision : Fraction of retrieved docs that are relevant to user’s information need
n Recall : Fraction of relevant docs in collection that are retrieved
n More precise definitions and measurements to follow in later lectures
Boolean retrieval
n Queries are Boolean expressionsn e.g., Brutus AND Caesarn Shakespeare’s Collected Worksn Which plays of Shakespeare contain
the words Brutus AND Caesar ?
n The search engine returns all documents satisfying the Boolean expression.n Does Google use the Boolean model?
n http://www.rhymezone.com/shakespeare/
Term-document incidence
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Brutus AND Caesar but NOTCalpurnia
Inverted index
n For each term T, we must store a list of all documents that contain T.
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings listsSorted by docID
Posting
Boolean query processing: AND
n Consider processing the query:Brutus AND Caesarn Locate Brutus in the Dictionary;
n Retrieve its postings.n Locate Caesar in the Dictionary;
n Retrieve its postings.n “Merge” the two postings:
12834
2 4 8 16 32 641 2 3 5 8 13 21
BrutusCaesar
Example: WestLaw http://www.westlaw.com/
n Commercially successful Boolean retrievaln Largest commercial (paying subscribers) legal
search service (started 1975; ranking added 1992)n Tens of terabytes of data; 700,000 usersn Majority of users still use boolean queriesn Example query:
n What is the statute of limitations in cases involving the federal tort claims act?
n LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
n /3 = within 3 words, /S = in same sentence
Query optimization
n What is the best order for query processing?n Consider a query that is an AND of t terms.n For each of the t terms, get its postings, then
AND them together.
Brutus
Calpurnia
Caesar
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Query: Brutus AND Calpurnia AND Caesar9
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 2The term vocabulary and
postings lists
Parsing a document
n Before we start worrying about terms … need to know format and language of each document
n What format is it in?n pdf/word/excel/html?
n What language is it in?n What character set is in use?n Each of these is a classification problem,
n But often done heuristically
What is a unit of document?
n A file?n Traditional Unix stores a sequence of emails in one file,
but you might want to regard each email as a separate document
n An email with 5 attachments?
n Indexing granularity, e.g. a collection of booksn Each book as a document?n Each chapter? Each paragraph? Each sentence?
n Precision recall tradeoffn Small unit: good precision, poor recalln Big unit: good recall, poor precision
Tokenization
n Input: “Friends, Romans, Countrymen”n Output: Tokens
n Friends n Romansn Countrymen
n Each such token is now a candidate for an index entry, after further processingn Described below
n But what are valid tokens to emit?
Common terms: stop wordsn Stop words = extreme common words which would
appear of little value in helping select document in matching a user needn a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on,
that, the, to, was, were, will, withn There are a lot of them: ~30% of postings for top 30 words
n Stop word elimination used to be standard in older IR systemsn Size of stop list: 200-300; 7-12
Current trendn The trend is away from doing this:
n Good compression techniques (lecture 5) means the space for including stopwords in a system is very small
n Good query optimization techniques mean you pay little at query time for including stop words.
n You need them for:n Phrase queries: “King of Denmark”n Various song titles, etc.: “Let it be”, “To be or not to be”n “Relational” queries: “flights to London”
n Nowadays search engines generally do not eliminate stop words
Normalizationn Need to “normalize” terms in indexed text as well as
query terms into the same formn We want to match U.S.A. and USA
n We most commonly implicitly define equivalence classes of termsn e.g., by deleting periods in a term
n Alternative is to do asymmetric expansion:n Enter: window Search: window, windowsn Enter: windows Search: Windows, windows, windown Enter: Windows Search: Windows (no expansion)
n Two approaches for the (more powerful) alternativen Index unnormalized tokens and expand query termsn Expand during index constructionn Both less efficient than equivalent claassing
Case folding
n Reduce all letters to lower casen exception: upper case in mid-sentence?
n e.g., General Motorsn Fed vs. fedn SAIL vs. sail
n Often best to lower case everything, since users will use lowercase regardless of ‘correct’capitalization…
Lemmatization
n Reduce inflectional/variant forms to base formn am, are, is → ben car, cars, car's, cars' → car
n the boy's cars are different colors → the boy car be different color
n Lemmatization implies doing “proper” reduction to dictionary headword form
Stemming
n Reduce terms to their “roots” before indexingn “Stemming” suggest crude affix chopping
n language dependentn e.g., automate(s), automatic, automation all
reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress andcompress ar both acceptas equival to compress
Porter’s algorithm (1980)
n Most common algorithm for stemming Englishn Results suggest it’s at least as good as other stemming
optionsn 5 phases of reductions
n phases applied sequentiallyn With each phase, there are various conventions of
selecting rulesn E.g., Sample convention: Of the rules in a compound
command, select the one that applies to the longest suffix.
n http://www.tartarus.org/~martin/PorterStemmer/
Phrase queries
n Want to be able to answer queries such as “stanford university” – as a phrase
n “The inventor Stanford Ovshinsky never went to univerisity” is not a match. n The concept of phrase queries has proven easily
understood by usersn 10% of web queries are phrase queries
n For this, it no longer suffices to store only<term : docs> entries
any ideas?
Solution 2: Positional indexes
n In the postings, store, for each term, entries of the form:<term, number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>
Proximity queries: same idean employment /3 placen Find all document that contain employment
and place within 3 words of each othern Employment agencies that place healthcare
workers are seeing growthn hit
n Employment agencies that help place healthcare workers are seeing growthn not a hit
n Clearly, positional indexes can be used for such queries; biword indexes cannot.
Positional index size
n You can compress position values/offsets: covered in chapter 5
n Nevertheless, a positional index expands postings storage substantiallyn Need an entry for each occurrence, not just once
per documentn Compare to biword: “Index blowup due to bigger
dictionary”n Nevertheless, a positional index is now
standardly used because of the power and usefulness of phrase and proximity queries
Rules of thumb
n A positional index is 2–4 as large as a non-positional index
n Positional index size 35–50% of volume of original text
n Caveat: the above holds for English-like languages.
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 3Dictionaries and Tolerant retrieval
Chapter 4Index construction
Chapter 5Index compression
Dictionaryn The dictionary is the data structure for storing the
term vocabularyn For each term, we need to store:
n document frequencyn pointers to each postings list
Dictionary data structures
n Two main choices:n Hash tablen Tree
n Some IR systems use hashes, some treesn Criteria in choosing hash or tree
n fixed number of terms or keep growingn Relative frequencies with which various keys are accessedn How many terms
Distributed indexing
n For web-scale indexing (don’t try this at home!):must use a distributed computing cluster
n Individual machines are fault-pronen Can unpredictably slow down or fail
n How do we exploit such a pool of machines?
Google data centers
n Google data centers mainly use commodity machinesn Data centers are distributed around the world.n Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)n Estimate: Google installs 100,000 servers each quarter.
n Based on expenditures of $200–250 million per yearn This would be 10% of the computing capacity of the
world!?!
Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Mapphase
Segment files Reducephase
MapReduce
n The index construction algorithm we just described is an instance of MapReduce.
n MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple architecture for distributed computing …
n … without having to write code for the distribution part.
MapReduce
n MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairsn For indexing, (termID, docID)
n Map: mapping splits of the input data to key-value pairsn Reduce: all values for a given key to be stored close
together, so that they can be read and processed quicklyn This is achieved by partitioning the keys into j terms
partitions and having the parsers write key-value pairs for each term partition into a separate segment file
MapReducen They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each implemented in MapReduce.
n Index construction was just one phase.n Another phase: transforming a term-partitioned index
into document-partitioned index.n Term-partitioned: one machine handles a subrange of termsn Document-partitioned: one machine handles a subrange of
documentsn (As we discuss in the web part of the course) most
search engines use a document-partitioned index … better load balancing, etc.)
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 6Scoring term weighting and the
vector space model
Ranked retrievaln Thus far, our queries have all been Boolean.
n Documents either match or don’tn Good for expert users with precise understanding of
their needs and the collection.n Also good for applications, which can easily consume
1000s of resultsn Not good for the majority of users.
n Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).
n Most users don’t want to wade through 1000s of results.n This is particularly true of web search.
Problem with Boolean search: feast or famine
n Boolean queries often result in either too few (=0) or too many (1000s) results.
n Query 1: “standard user dlink 650”n 200,000 hits
n Query 2: “standard user dlink 650 no card found”n 0 hits
n It takes skill to come up with a query that produces a manageable number of hits.n AND gives too few; OR gives too many
Take 1: Jaccard coefficient
n A commonly used measure of overlap of two sets A and B
n jaccard(A,B) = |A ∩ B| / |A ∪ B|n jaccard(A,A) = 1n jaccard(A,B) = 0 if A ∩ B = 0
n Always assigns a number between 0 and 1.
Issues with Jaccard for scoring
n It doesn’t consider term frequency (how many times a term occurs in a document)n tf weight
n Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this informationn idf weight
n We need a more sophisticated way of normalizing for lengthn cosine
Bag of words model
n Vector representation doesn’t consider the ordering of words in a document
n John is quicker than Mary and Mary is quicker than John have the same vectors
n This is called the bag of words model.
Term frequency
n The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
n We want to use term frequency when computing query-document match scores. But how?
n Rawtermfrequencymaynotbewhatwewant:n Adocumentwith10occurrencesofthetermismorerelevantthanadocumentwith1occurrenceoftheterm.
n Butnot10timesmorerelevant.n Relevancedoesnotincreaseproportionallywithtermfrequency
term frequency (tf) weightn many variants for tf weight, where log-frequency
weighting is a common one, dampening the effect of raw tf (raw count)
log tft,d =
n 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.n The score is 0 if none of the query terms is
present in the document.
1 + log10 tft,d, if tft,d > 00, otherwise
!"#
$#
Document frequency
n Rare terms are more informative than frequent termsn Recall stop words
n Consider a term in the query that is rare in the collection (e.g., arachnocentric)
n A document containing this term is very likely to be relevant to the query arachnocentric
n → We want a high weight for rare terms like arachnocentric.
Document frequency, continued
n Consider a query term that is frequent in the collection (e.g., high, increase, line)
n A document containing such a term is more likely to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.
n For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.
n We will use document frequency (df) to capture this in the score.
n df (≤ N) is the number of documents that contain the term
Inverse document frequency (idf) weight
n dft is the document frequency of t: the number of documents that contain tn dft is an inverse measure of the informativeness of tn Inverse document frequency is a direct measure of the
informativeness of tn We define the idf (inverse document frequency) of t by
n use log to dampen the effect of N/dftn Most common variant of idf weight
tt N/df log idf 10=
idf example, suppose N= 1 millionterm dft idftcalpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
There is one idf value for each term t in a collection.
)/df( log idf 10 tt N=
Collection vs. Document frequencyn The collection frequency of t is the number of
occurrences of t in the collection, counting multiple occurrences.
n Example: which word is a better search term (and should get a higher weight)?
n The example suggests that df is better for weighting than cf
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
tf-idf weightingn The it-idf weight of a term is the product of its tf weight
and its idf weight.
tf weight (t,d) x idf weight (t)
n Increases with the number of occurrences within a document
n Increases with the rarity of the term in the collectionn Best known instantiation of TF-IDF weighting
tf -idft,d =
(1+ log10 tft,d )× log10 (N / dft )
Recall: Binary term-document incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 0Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1worser 1 0 1 1 1 0
Each document is represented by a binary vector ∈ {0,1}|V|
Term-document count matrices
n Consider the number of occurrences of a term in a document: n Each document is a count vector in ℕv: a column below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1worser 2 0 1 1 1 0
Binary → count → weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 2.44 0 0 0 0Brutus 0.16 6.10 0 0.04 0 0Caesar 8.59 8.40 0 0.07 0.04 0.04
Calpurnia 0 1.54 0 0 0 0Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 2.27 3.78 3.78 0.76worser 1.37 0 0.69 0.69 0.69 0
Each document is now represented by a real-valued vector of TF-IDF weights ∈ R|V|
Documents as vectors
n So we have a |V|-dimensional vector spacen Terms are axes of the spacen Documents are points or vectors in this spacen Very high-dimensional
n hundreds of millions of dimensions when you apply this to a web search engine
n This is a very sparse vectorn most entries are zero
Queries as vectors
n Key idea 1: Do the same for queries: represent them as vectors in the space
n Key idea 2: Rank documents according to their proximity to the query in this space
n proximity = similarity of vectorsn proximity ≈ inverse of distancen Recall: We do this because we want to get away
from the either-in-or-out Boolean model.n Instead: rank more relevant documents higher
than less relevant documents
Why Euclidean distance is badThe Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 arevery similar.
Use angle instead of distance
n Thought experiment: take a document d and append it to itself. Call this document d′.
n “Semantically” d and d′ have the same contentn The Euclidean distance between the two
documents can be quite largen The angle between the two documents is 0,
corresponding to maximal similarity.
Length normalization
n A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:
n Dividing a vector by its L2 norm makes it a unit (length) vector
n Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
n The cosine of the angle between two normalized vectors is the dot product of the two
∑=i ixx 2
2
!
cosine(query,document)
cos(!q,!d ) =
!q •!d!q!d=!q!q•
!d!d=
qtdtt=1
V∑qt2
t=1
V∑ dt
2
t=1
V∑
Dot product Unit vectors
qt is the tf-idf weight of term t in the querydt is the tf-idf weight of term t in the document
cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.
• The cosine similarity can be seen as a method of normalizing document length during comparison
Cosine similarity exampled q normalized d normalized q
t1 0.5 1.5 0.51 0.83t2 0.8 1 0.81 0.555t3 0.3 0 0.30 0 sim(d, q) = 0.5x1.5 + 0.8x1 + 0.3x0 _
sqrt(0.52+0.82+ .32) x sqrt(1.52+12+02)= 1.55 _
0.99 x 1.8 = 0.87
sim(d,q) = 0.51x0.83 + 0.81x0.555 + 0.30x0 = 0.87
More variants of TF-IDF weighting
SMART notation: columns headed ‘n’ are acronymsfor weight schemes.
Summary – vector space ranking
n Represent the query as a weighted TF-IDF vectorn Represent each document as a weighted TF-IDF
vectorn Compute the cosine similarity score for the query
vector and each document vectorn Rank documents with respect to the query by scoren Return the top k (e.g., k = 10) to the user
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 7Computing scores in a complete
search system
Content
n Speeding up vector space rankingn Putting together a complete search
system
Cluster pruning: query processing
n Process a query as follows:n Given query Q, find its nearest leader L.n Seek K nearest docs from among L’s
followers.
Visualization
Query
Leader Follower
Putting it all together
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 8Evaluation and Result Summaries
Summariesn The title is typically automatically extracted from
document metadata. What about the summaries?n This description is crucial.n User can identify good/relevant hits based on description.
n Two basic kinds:n Staticn Dynamic
n A static summary of a document is always the same, regardless of the query that hit the doc
n A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand
Dynamic summariesn Present one or more “windows” within the document
that contain several of the query termsn “KWIC” snippets: Keyword in Context presentation
n Generated in conjunction with scoringn If query found as a phrase, all or some occurrences of the
phrase in the docn If not, document windows that contain multiple query terms
n The summary itself gives the entire content of the window – all terms, not only the query terms – how?
Evaluating search engines
Relevance to what?n Relevance is assessed relative to the information
need not the queryn E.g., Information need: I'm looking for information
on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the
information need, not whether it has these words
n Our terminology is sloppy: we talk about query-document relevance judgment although we mean information-need-document relevance judgment
Unranked retrieval evaluation:Precision and Recall
n Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
n Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
Should we instead use the accuracy measure for evaluation?
n Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”
n The accuracy of an engine: the fraction of these classifications that are correct
n Accuracy is a commonly used evaluation measure in machine learning classification work
n Why is this not a very useful evaluation measure in IR?
Why not just use accuracy?
n How to build a 99.9999% accurate search engine on a low budget….
n People doing information retrieval want to findsomething and have a certain tolerance for junk.
Search for:
0 matching results found.
Precision/Recall tradeoff
n You can get high recall (but low precision) by retrieving all docs for all queries!
n Recall is a non-decreasing function of the number of docs retrieved
n In a good system, precision decreases as either the number of docs retrieved or recall increasesn This is not a theorem, but a result with strong
empirical confirmation
A combined measure: Fn Combined measure that assesses precision/recall tradeoff is
F measure:
n Weighted harmonic mean of P and R:n People usually use balanced F measure
n F1 ; or F β =1 ; n with β = 1 or α = ½; harmonic mean:
n β < 1 emphasizes P or R?n Either P or R is bad -> bad F
RPPR
RP
F+
+=
−+= 2
2 )1(1)1(1
1ββ
αα αα
β)1(2 −
=
RPF1)1(11
αα −+=
)11(211
RPF+=
Evaluating ranked results
n P/R/F are measured for unranked setsn We can easily turn set measures into measures
for ranked resultsn The system can return any number of resultsn Just use the set measures for each “prefix”, the
top 1, top 2, top 3, top 4, etc., resultsn Doing this for precision and recall produces a
precision-recall curve, where a “prefix”corresponds to a level of recall
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0Recall
Precision
n Sawtooth shape:n If the (k+1)th doc is
non-relevant, R is the same as for the top k docs, but P has dropped
n If it is relevant, then both P and R increase, and the curve jags up and to the right
n Often useful to remove the jiggles: interpolationn Take maximum precision of all future points
11-point interpolated average precisionn Entire precision-recall graph is very
informative, but there is often a desire to boil this information down to a few numbers, even a single number
n 11-point interpolated average precisionn The standard measure in the early TREC
competitionsn take the precision at 11 levels of recall
varying from 0 to 1 by tenths of the documents, using interpolation, and average over queries
n Evaluates performance at all recall levels
Typical (good) 11 point precisionsn SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
Mean average precision (MAP)n Recently, other measures have become more common. Most
standard among TREC community is MAPn A single figure measure of quality across recall levelsn Good discrimination and stability
n For a single information need, average precision is the average of precision value obtained for the top k docs each time a relevant doc is retrievedn Approximates the area under the un-interpolated precision-recall curve
n Then, this value (average precision) is averaged over many information needs to get MAPn Approximates the average area under the precision-curve for a set of
queries
Yet more evaluation measures…n The above ones factor in precision at all recall levelsn For many prominent applications, e.g., web search, this may
not be appropriate, where what matters is rather how many good results there are on the first page or the first 3 pages!n Leads to measuring precision at fixed low levels (e.g., 10 or 30) of
retrieved resultsn Precision at k: precision of top k results
n Standard for web searchn Cons: the least stable among commonly used measures; does not
average well because the total number of relevant docs for a query has strong influence on precision at k
n R-precision alleviates this problemn But may not be feasible for web search
R-precisionn If have known (though perhaps incomplete) set of relevant
documents of size Rel, then calculate precision of top Rel docs returnedn Averaging the measure across queries makes more sense
n If there are |Rel| relevant docs for query, we examine the top |Rel| results, and find r are relevant. Then, n recall = precision = r / |Rel|n Thus, R-precision is identical to the break-even point
n Empirically, highly correlated with MAP
Critique of pure relevance
n Assumption: relevance of one doc is treated as independent of relevance of other docs in the collectionn But a document can be redundant (e.g., duplicates) even if
it is highly relevantn Duplicates
n Marginal Relevance: concerns whether a doc still have distinctive usefulness after the user has looked at certain other documents … (Carbonell and Goldstein 1998)
n Maximizing marginal relevance requires returning documents that exhibit diversity and novelty
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 19Web search basics
1. Brief history and overview
n Early keyword-based enginesn Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997
n A hierarchy of categoriesn Yahoo!n Many problems, popularity declined. Existing variants
are About.com and Open Directory Project
n Classical IR techniques continue to be necessary for web search, by no means sufficientn E.g., classical IR measures relevancy, web search
needs to measure relevancy + authoritativeness
Web search overview
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web spider
Indexer
Indexes
Search
User
Web IR: Differences from traditional IR
n Links: The web is a hyperlinked document collectionn Queries: web queries are different, more varied and there are a lot of
themn How many? 108 every day, approaching 109
n Users: users are different, more varied and there are a lot of themn How many? 109
n Documents: documents are different, more varied and there are a lot of themn How many? ~ 1011. Indexed 1010
n Context: context is more important on the web than in many other IR applications
n Ads and spam
Duplicate documentsn Significant duplication: 30-40% duplicates in some studiesn Duplicates in search results were common in early days of
the Webn Today’s search engines eliminate duplicates very
effectivelyn Key for high user satisfaction
Duplicate detection
n The web is full of duplicated contentn Strict duplicate detection = exact match
n Not as commonn But many, many cases of near duplicates
n E.g., Last modified date the only difference between two copies of a page
n Various techniquesn Fingerprint, shingles, sketch
Size of the web: issues
n How to define size? Number of web serves? Number of pages? Terabytes of data available?n Some servers are seldom connected
n example: your laptop running a web servern Is it part of the web?
n The “dynamic” web is infiniten Any sum of two numbers is its own dynamic page on Google (e.g., “2+4”)
Goal of spamming on the webn You have a page that will generate lots of revenue for
you if people visit itn Therefore, you’d like to redirect visitors to this pagen One way of doing this: get your page ranked highly in
search results
Simplest formsn First generation engines relied heavily on tf/idfn Hidden text: dense repetitions of chosen keywords
n Often, the repetitions would be in the same color as the background of the web page. So that repeated terms got indexed by crawlers, but not visible to humans on browsers
n Keyword stuffing: misleading meta-tags with excessive repetition of chosen keywords
n Used to be effective, most search engines now catch these
n Spammers responded with a richer set of spam techniques
Link spamn Create lots of links pointing to the page you want to
promoten Put these links on pages with high (at least non-zero)
pagerankn Newer registered domains (domain flooding)n A set of pages pointing to each other to boost each
other’s pagerank (mutual admiration society)n Pay somebody to put your link on their highly ranked
page (“schuetze horoskop” example”)n http://www-csli.stanford.edu/~hinrich/horoskop-schuetze.html
n Leave comments that include the link on blogsn Link farm
Search engine optimization
n Promoting a page is not necessarily spamn It can also be a legitimate business, which is called SEO
n You can hire an SEO firm to get your page highly rankedn Motives
n Commercial, political, religious, lobbiesn Promotion funded by advertising budget
n Operatorsn Contractors (Search Engine Optimizers) for lobbies, companiesn Web mastersn Hosting services
n Forumsn E.g., Web master world ( www.webmasterworld.com )
3. Advertising as economic model
n Sponsored search ranking: Goto.com (morphed into Overture.com → Yahoo!)n Your search ranking depended on how much you paidn Auction for keywords: casino was expensive!n No separation of ads/docs
n 1998+: Link-based ranking pioneered by Googlen Blew away all early enginesn Google added paid-placement “ads” to the side,
independent of search resultsn Strict separation of ads and results
First generation of search ads: Goto (1996)
n No separation of ads/docs. Just one results!n Buddy Blake bid the maximum ($0.38) for this searchn He paid $0.38 to Goto every time somebody clicked on the linkn Upfront and honest. No relevance ranking, but Goto did not pretend
there was any.
Algorithmic results.
Ads
The appeal of search ads to advertisers
n Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots?
n Someone who just searched for “Saturn Aura Sport Sedan” is infinitely more likely to buy one than a random person watching TV.
n Most importantly, the advertiser only pays if the customer took an action indicating interest (i.e., clicking on the ad)
Users of web searchn Use short queries (average < 3)n Rarely use operatorsn Don’t want to spend a lot of time on composing a queryn Only look at the first couple of resultsn Want a simple UI, not a search engine start page
overloaded with graphicsn Extreme variability in terms of user needs, user
expectations, experience, knowledge, …n Industrial/developing world, English/Estonian, old/young,
rich/poor, differences in culture and classn One interface for hugely divergent needs
User query needsn Need [Brod02, RL04]
n Informational – want to learn about something (~40% / 65%)n Not a single page containing the info
n Navigational – want to go to that page (~25% / 15%)
n Transactional – want to do something (web-mediated) (~35% / 20%)n Access a service
n Downloads
n Shopn Gray areas
n Find a good hubn Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weatherMars surface images
Canon S410
Car rental Brasil
Query distribution (1)
Query distribution (2)n Queries have a power law distributionn Recall Zipf’s law: a few very frequent words, a large
number of very rare wordsn Same here very frequent queries, a large number of very
rare queriesn Examples of rare queries: search for names, towns,
books etcn The proportion of adult queries is much lower than 1/3
Introduction to Information Retrieval(Manning, Raghavan, Schutze)
Chapter 21Link analysis
The Web as a Directed Graph
Assumption 1: a hyperlink is a quality signal• A hyperlink between pages denotes author perceived relevance
Assumption 2: The anchor text describes the target page• we use anchor text somewhat loosely here• extended anchor text, window of text surrounding anchor text• You can find cheap cars <a href= …>here</a>
Page Ahyperlink Page BAnchor
Google bombsn Indexing anchor text can have unexpected side
effects: Google bombs.n whatelse does not have side effects?
n A Google bomb is a search with “bad” results due to maliciously manipulated anchor text
n Google introduced a new weighting function in January 2007 that fixed many Google bombs
Google bomb example
Cocitation similarity on Google:similar pages
Origins of PageRank: Citation analysis (1)
Origins of PageRank: Citation analysis (2)
Query-independent ordering
n First generation link-based ranking for web search n using link counts as simple measures of popularity.n simple link popularity: number of in-links
n First, retrieve all pages meeting the text query (say venture capital).n Then, order these by the simple link popularity
n Easy to spam. Why?
Basics for PageRank: random walk
n Imagine a web surfer doing a random walk on the web page:n start at a random pagen at each step, go out of the current page along one of
the links on that page, equiprobablyn In the steady state each page has a long-term visit
rate - use this as the page’s scoren So, pagerank = steady state probability
= long-term visit rate
1/31/31/3
Not quite enough
n The web is full of dead-endsn random walk can get stuck in dead-endsn makes no sense to talk about long-term visit rates
??
Teleportingn Teleport operation: surfer jumps from a node to any
other node in the web graph, chosen uniformly at random from all web pages
n Used in two ways:n At a dead end, jump to a random web pagen At any non-dead end, with teleportation probability 0 < α < 1
(say, α = 0.1), jump to a random web page; with remaining probability 1 - α (0.9), go out on a random link
n Now cannot get stuck locallyn There is a long-term rate at which any page is visited
n Not obvious, explain latern How do we compute this visit rate?
Markov chains
n A Markov chain consists of n states, plus an n×ntransition probability matrix P.
n At each step, we are in exactly one of the states.n For 1 ≤ i, j ≤ n, the matrix entry Pij tells us the
probability of j being the next state, given we are currently in state i.
n Clearly, for each i,
n Markov chains are abstractions of random walkn State = page
i jPij
.11
=∑=
ij
n
jP
ExerciseRepresent the teleporting random walk as a Markov chain, for the following case, using transition probability matrix
α = 0.3:
0.1 0.45 0.451/3 1/3 1/3 0.45 0.45 0.1
C A B
C A B
0.1
0.450.45
0.45
1/3
State diagram
Link structure
0.1
1/3
0.45
1/3
Transition matrix
Ergodic Markov chains
n A Markov chain is ergodic iff it’s irreducible and aperiodic n Irreducibility: roughly, there is a path from any state to any
othern Aperiodicity: roughly, the nodes cannot be partitioned
such that the random walker visits the partitions sequentially
n A non-ergodic Markov chain
1
1
Ergodic Markov chains
n Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.n Steady-state probability distribution.
n Over a long time-period, we visit each state in proportion to this rate.
n It doesn’t matter where we start.
Formalization of visit: probability vector
n A probability (row) vector x = (x1, … xn) tells us where the walk is at any point.
n e.g., (000…1…000) means we’re in state ii n1
More generally, the vector x = (x1, … xn)means the walk is in state i with probability xi
xii=1
n
∑ =1
Change in probability vector
n If the probability vector is x = (x1, … xn) at this step, what is it at the next step?
n Recall that row i of the transition prob. matrix P tells us where we go next from state i.
n So from x, our next state is distributed as xP
Steady state example
n The steady state is simply a vector of probabilities a = (a1, … an):n ai is the probability that we are in state in ai is the long-term visit rate (or pagerank) of state (page) In so we can think of pagerank as a long vector, one entry for
each page
How do we compute this vector?
n Let a = (a1, … an) denote the row vector of steady-state probabilities.
n If our current position is described by a, then the next step is distributed as aP
n But a is the steady state, so a=aPn Solving this matrix equation gives us a
n so a is the (left) eigenvector for Pn corresponds to the principal eigenvector of P with
the largest eigenvaluen transition probability matrices always have larges
eigenvalue 1
One way of computing
n Recall, regardless of where we start, we eventually reach the steady state a
n Start with any distribution (say x=(10…0)).n After one step, we’re at xPn after two steps at xP2 , then xP3 and so on.n “Eventually” means for “large” k, xPk = an Algorithm: multiply x by increasing powers of P
until the product looks stable
n This is called the power method
Power method: example
Pagerank summaryn Preprocessing:
n Given graph of links, build transition probability matrix Pn From it compute an The entry ai is a number between 0 and 1: the pagerank of page i.
n Query processing:n Retrieve pages meeting queryn Rank them by their pagerankn Order is query-independent
n In practice, pagerank alone wouldn’t workn Google paper:
http://infolab.stanford.edu/~backrub/google.html
In practice
n Consider the query “video service”n Yahoo! has very high pagerank, and contains both wordsn With simple pagerank alone, Yahoo! Would be top-rankedn Clearly not desirable
n In practice, composite score is used in rankingn Pagerank, cosine similarity, term proximity etc.n May apply machine-learned scoringn Many other clever heuristics are used
How important is PageRank?
Pagerank: Issues and Variants
n How realistic is the random surfer model?n What if we modeled the back button?n Surfer behavior sharply skewed towards short pathsn Search engines, bookmarks & directories make jumps non-
random.
n Biased Surfer Modelsn Weight edge traversal probabilities based on match with
topic/query (non-uniform edge selection)n Bias jumps to pages on topic (e.g., based on personal
bookmarks & categories of interest)n Non-uniform teleportation allows topic-specific
pagerank and personalized pagerank
Topic Specific Pagerankn Conceptually, we use a random surfer who
teleports, with say 10% probability, using the following rule:
n Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories
n Teleport to a page uniformly at random within the chosen category
Pagerank applications beyond web search
n A person is reputable if s/he receives many references from reputable people.
n How to compute reputation for people?
n Rent a room in an exhibition center. Find one with the most visit rate.
Hyperlink-Induced Topic Search (HITS)
n In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:n Hub pages are good lists of links to pages answering
the information needn e.g., “Bob’s list of cancer-related links
n Authority pages are direct answers to the information need
n occur recurrently on good hubs for the subject
n Most approaches to search do not make the distinction between the two sets
Hubs and Authorities
n Thus, a good hub page for a topic pointsto many authoritative pages for that topic
n A good authority page for a topic is pointed to by many good hubs for that topic
n Circular definition - will turn this into an iterative computation
Examples of hubs and authorities
AT&T Alice
SprintBob MCILong distance telephone companies
HubsAuthorities
High-level scheme
n Do a regular web search firstn Call the search results the root setn Add in any page that either
n points to a page in the root set, orn is pointed to by a page in the root set
n Call this the base setn From these, identify a small set of top hub and
authority pagesn Iterative algorithm
Visualization
Rootset
Base set
Assembling the base set
n Root set typically 200-1000 nodesn Base set may have up to 5000 nodesn How do you find the base set nodes?
n Follow out-links by parsing root set pagesn Get in-links from a connectivity server, get pages
n This assumes our inverted index supports searches for links, in addition to terms
Distilling hubs and authorities
n Compute, for each page x in the base set, a hub score h(x) and an authority score a(x)
n Initialize: for all x, h(x)←1; a(x) ←1;n Iteratively update all h(x), a(x);n After convergence
n output pages with highest h() scores as top hubsn output pages with highest a() scores as top authoritiesn so we output two ranked lists
Key
Iterative update
n Iterate these two steps until convergence
for all x:
for all x:
∑←yxyaxh
!
)()(
∑←xyyhxa
!
)()(
x
x
Scaling
n To prevent the h() and a() values from getting too big, can scale down after each iteration
n Scaling factor doesn’t really matter:n we only care about the relative values
of the scores
How many iterations?
n Relative values of scores will converge after a few iterations
n In fact, suitably scaled, h() and a() scores settle into a steady state!n proof of this comes later
n In practice, ~5 iterations get you close to stability
Japan Elementary Schools
n The American School in Japan n The Link Page n ‰ª�è�s—§ˆä“c�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n Kids' Space n ˆÀ�é�s—§ˆÀ�é�¼•”�¬Šw�Z n ‹{�鋳ˆç‘åŠw•�‘®�¬Šw�Z n KEIMEI GAKUEN Home Page ( Japanese ) n Shiranuma Home Page n fuzoku-es.fukui-u.ac.jp n welcome to Miasa E&J school n �_“Þ�쌧�E‰¡•l�s—§’†�ì�¼�¬Šw�Z‚̃yn http://www...p/~m_maru/index.html n fukui haruyama-es HomePage n Torisu primary school n goo n Yakumo Elementary,Hokkaido,Japan n FUZOKU Home Page n Kamishibun Elementary School...
n schools n LINK Page-13 n “ú–{‚ÌŠw�Z n �a‰„�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n 100 Schools Home Pages (English) n K-12 from Japan 10/...rnet and Education ) n http://www...iglobe.ne.jp/~IKESAN n ‚l‚f‚j�¬Šw�Z‚U”N‚P‘g•¨Œê n �ÒŠ—’¬—§�ÒŠ—“Œ�¬Šw�Z n Koulutus ja oppilaitokset n TOYODA HOMEPAGE n Education n Cay's Homepage(Japanese) n –y“ì�¬Šw�Z‚̃z�[ƒ�ƒy�[ƒW n UNIVERSITY n ‰J—³�¬Šw�Z DRAGON97-TOP n �‰ª�¬Šw�Z‚T”N‚P‘gƒz�[ƒ�ƒy�[ƒW n ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Hubs Authorities
Things to note
n Pulled together good pages regardless of language of page content.
n Use only link analysis after base set assembledn Is HITS query-independent?
n Typical use, non Iterative computation after text index retrieval -
significant overhead.
PageRank vs. HITS: Discussionn The PageRank and HITS make two different design choices
concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to
n These two are orthogonaln We could also apply HITS to the entire web and PageRank to
a small base setn On the web, a good hub almost always is also a good
authorityn The actual difference between PageRank ranking and HITS
ranking is therefore not as large as one might expect
HITS applications beyond web search
n Researchers publish/present papers in conferences. A conference is reputable if it hosts many reputable researchers to publish/present their papers. A researcher is reputable if s/he publishes/presents many papers in reputable conferences.
n How to compute reputation for conferences? How to compute reputation for researchers?