Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends...
-
Upload
rosaline-wilkerson -
Category
Documents
-
view
213 -
download
0
Transcript of Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends...
Preparing the Text for Indexing
1
Collecting the Documents
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
2
What exactly is “a document”?
What format is it in? pdf/word/excel/html?
What language is it in?What character set is in use?How easy is it to “read it in”?
Each of these is a classification problem, that can be tackled with machine learning.
But these tasks are often done heuristically …
3
Complications: Format/language
What is a unit document? A file? A book? A chapter of a book? An email? An email with 5 attachments? A group of files (PPT or LaTeX in HTML)
Documents being indexed can include documents from many different languages
An index may have to contain terms of several languages.
Sometimes a document or its components can contain multiple languages/formats
French email with a German pdf attachment.
4
Getting the Tokens
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
5
6
Words, Terms, Tokens, Types
Word – A delimited string of characters as it appears in the text.
Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.
Token – An instance of a word or term occurring in a document.
Type – The same as a term in most cases; an equivalence class of tokens.
6
Tokenization
Input: “Friends, Romans and Countrymen”Output: Tokens
Friends Romans Countrymen
Each such token is now a candidate for an index entry, after further processing
But what are valid tokens to emit?
7
What is a valid token?
Finland’s capital Finland? Finlands? Finland’s?
Hewlett-Packard Hewlett and Packard as two tokens?
State-of-the-art break up hyphenated sequence.
co-education ?
San Francisco: one token or two? San Francisco-Los Angeles highway: how many tokens?Dr. Summer address is 35 Winter St., 23014-1234, RI, USA.
8
Tokenization: Numbers
3/12/91 Mar. 12, 1991 3 Dec. 199152 B.C.B-52My PGP key is 324a3df234cb23e100.2.86.144
Often, don’t index as text.But often very useful: e.g., for looking up error codes on the web
Often, we index “meta-data” separately Creation date, format, etc.
9
Tokenization: language issues
East Asian languages (e.g., Chinese and Japanese) have no spaces between words:
莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization
Semitic languages (Arabic, Hebrew) are written right to left, but certain items (e.g. numbers) written left to rightWords are separated, but letter forms within a word form complex ligatures
سنة في الجزائر االحتالل 132بعد 1962استقلت من عاما .الفرنسي
‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
10
Making sense of the Tokens
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
11
Linguistic Processing
NormalizationCapitalization/Case-foldingStop wordsStemmingLemmatization
12
Linguistic Processing: Normalization
Need to “normalize” terms in indexed text & in query terms into the same form
We want to match U.S.A. and USA
We most commonly define equivalence classes of terms e.g., by deleting periods in a term
Alternative is to do asymmetric expansion:Enter: window Search: window, windowsEnter: windows Search: Windows, windowsEnter: Windows Search: Windows
13
Normalization: other languages
Accents: résumé vs. resume.Most important criterion:
How are your users likely to write their queries for these words?
Even in languages that standardly have accents, users often may not type them
German: Tuebingen vs. Tübingen Should be equivalent
14
Linguistic Processing: Case folding
Reduce all letters to lower case exception: upper case (in mid-sentence?)
e.g., General Motors Fed vs. fed SAIL vs. sail
Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…
15
Linguistic Processing: Stop Words
With a stop list, you exclude from dictionary entirely the commonest words. Why do it:
They have little semantic content: the, a, and, to, be They take a lot of space: ~30% of postings for top 30
You will measure this!
But the trend is away from doing this: You need them for:
Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”
16
Linguistic Processing: Stemming
Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to
automat.
Porter’s Algorithm Commonest algorithm for stemming English Results suggest at least as good as other stemming options You find the algorithm and several implementations at
http://tartarus.org/~martin/PorterStemmer/
17
Typical rules in Porter
Rule Examplesses ss caresses caressies i butterflies butterfliing meeting meettional tion intentional intention
Weight of word sensitive rules (m>1) EMENT →
replacement → replaccement → cement
18
An Example of Stemming
After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance.
after introduc a gener search engin architectur, we examin each engin compon in turn.
we cover crawl, local web page storag, index, and the us of link analysi for boost search perform.
19
Linguistic Processing: Lemmatization
Reduce inflectional/variant forms to base formE.g.,
am, are, be, is be car, cars, car's, cars' car
Lemmatization implies doing “proper” reduction to dictionary headword form
the boy's cars are different colors the boy car be different color
20
Phrase queries
Want to answer queries such as “stanford university” – as a phrase
Thus the sentence “I went to university at Stanford” is not a match.
The concept of phrase queries has proven easily understood by users – they use quotes
about 10% of web queries are phrase queries In average a query is 2.3 words long. (Is it still the case?)
No longer suffices to store only <term : docs> entries
21
Phrase queries via Biword Indices
Biwords: Index every consecutive pair of terms in the text as a phrase
For example the text “Friends, Romans, Countrymen” would generate the biwords
friends romans romans countrymen
Each of these biwords is now a dictionary term
Two-word phrase query-processing is now immediate (it works exactly like the one term process)
22
Longer phrase queries
stanford university palo alto can be broken into the Boolean query on biwords:
stanford university AND university palo AND palo alto
Without examining each of the returned docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.
Can have false positives! (Why?)
23
Issues for biword indexes
False positives, as noted beforeIndex blowup due to bigger dictionary
For extended biword index, parsing longer queries into conjunctions:
E.g., the query tangerine trees, marmalade skies is parsed into tangerine trees AND trees marmalade AND marmalade skies
Not standard solution (for all biwords)
24
Better solution: Positional indexes
Store, for each term, entries of the form:<number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>
<be: 993427;1: 6 {7, 18, 33, 72, 86, 231};2: 2 {3, 149};4: 5 {17, 191, 291, 430, 434};5: 2 {363, 367}; …>
Which of docs 1,2,4,5could contain “to be
or not to be”?
25
Processing a phrase query
Merge their doc:position lists to enumerate all positions with “to be or not to be”.
to, 993427:
2: 5{1,17,74,222,551}; 4: 5{8,16,190,429,433}; 7: 3{13,23,191}; ...
be, 178239:
1: 2{17,19}; 4: 5{17,191,291,430,434}; 5: 3{14,19,101}; ...
Same general method for proximity searches
26
27
Extra Slides
Combination schemes
A positional index expands postings storage substantially (Why?)
Biword indexes and positional indexes approaches can be profitably combined
For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists
Even more so for phrases like “The Who” But is useful for “miserable failure”
28
Some Statistics
29
Query 2013 results
2013 time
2011 results
2011 time
britney spears 409,000,000 0.24 99,000,000 0.09emmy noether 930,000 0.26 260,000 0.59
the who 9,440,000,000 0.27 848,000,000 0.09
wellesley college 4,330,000 0.21 979,000 0.07
worcester college 4,750,000 0.48 473,000 0.55
fast cars 181,000,000 0.27 24,300,000 0.11slow cars 158,000,000 0.36 553,000 0.23
sfjvrbfb wrla 1 0.29
IR: The not-so-exciting part
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
30
31
Recap: The Inverted IndexFor each term t, we store a
sorted list of all documents that contain t.
31
Dictionary Postings
32
Intersecting two posting lists
32
Does Google use the Boolean model?
The default interpretation of a query [w1 w2 . . .wn] is w1 AND w2 AND . . .AND wn
You also get hits that do not contain one of the wi:anchor textpage contains variant of wi (morphology, spelling correction, synonym)long queries (n > 10?)when boolean expression generates very few hits
Boolean vs. Ranking of result setMost well designed Boolean engines rank the Boolean result set they rank “good hits” higher than “bad hits” (according to some estimator of relevance)
33