cs707_011712
-
Upload
cecsdistancelab -
Category
Documents
-
view
216 -
download
0
Transcript of cs707_011712
-
8/3/2019 cs707_011712
1/75
Prasad L05TolerantIR 1
Tolerant IR
Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and
Christopher Manning (Stanford)
-
8/3/2019 cs707_011712
2/75
-
8/3/2019 cs707_011712
3/75
Dictionary data structures for
inverted indexesn The dictionary data structure stores the term
vocabulary, document frequency, pointers toeach postings list in what data structure?
Sec. 3.1
3
-
8/3/2019 cs707_011712
4/75
A nave dictionaryn An array of struct:
char[20] int Postings *20 bytes 4/8 bytes 4/8 bytes
n How do we store a dictionary in memory efficiently?n How do we quickly look up elements at query time?
Sec. 3.1
4
-
8/3/2019 cs707_011712
5/75
Dictionary data structuresn Two main choices:
n Hashtablesn Trees
n Some IR systems use hashtables, some trees
Sec. 3.1
5
-
8/3/2019 cs707_011712
6/75
Hashtablesn Each vocabulary term is hashed to an integer
n (We assume you
ve seen hashtables before)n
Pros:n Lookup is faster than for a tree: O(1)
n Cons:n No easy way to find minor variants:
n judgment/judgementn No prefix search [tolerant retrieval]n If vocabulary keeps growing, need to occasionally
do the expensive operation of rehashingeverything
Sec. 3.1
6
-
8/3/2019 cs707_011712
7/75
Roota-m n-z
a-hu hy-m n-sh si-z
Tree: binary tree
Sec. 3.1
7
-
8/3/2019 cs707_011712
8/75
Tree: B-tree
n Definition: Every internal nodel has a number of children in the interval [ a ,b] where a, b areappropriate natural numbers, e.g., [2,4].
a-huhy-m
n-z
Sec. 3.1
8
-
8/3/2019 cs707_011712
9/75
Treesn Simplest: binary treen More usual: B-treesn Trees require a standard ordering of characters and
hence strings but we typically have onen Pros:
n Solves the prefix problem (terms starting with hyp )n Cons:
n Slower: O(log M ) [and this requires balanced tree]n Rebalancing binary trees is expensive
n But B-trees mitigate the rebalancing problem
Sec. 3.1
9
-
8/3/2019 cs707_011712
10/75
10
Wild-card queries
-
8/3/2019 cs707_011712
11/75
11
Wild-card queries: *n mon*: find all docs containing any word beginning
with mon.n Hashing unsuitable because order not preservedn Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon w < moon *mon: find words ending in mon: harder
n Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom w < non .
Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?
-
8/3/2019 cs707_011712
12/75
12
Query processingn At this point, we have an enumeration of all terms
in the dictionary that match the wild-card query.n We still have to look up the postings for each
enumerated term.n E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean AND queries.
-
8/3/2019 cs707_011712
13/75
13
B-trees handle *s at the end of aquery term
n How can we handle *s in the middle of queryterm? (Especially multiple *s)
n Consider co*tionn We could look up co* AND *tion in a B-tree and
intersect the two term setsn Expensive
n The solution : transform every wild-card query sothat the *s occur at the end
n This gives rise to the Permuterm Index .
-
8/3/2019 cs707_011712
14/75
14
Permuterm index
n For term hello index under:n hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.
n Queries:n X lookup on X$ *X lookup on X$* n *X* lookup on X*n X*Y lookup on Y$X* X*Y*Z ???
Query = hel*o X= hel, Y= o
Lookup o $ hel*
-
8/3/2019 cs707_011712
15/75
15
Permuterm query processing
n Rotate query wild-card to the rightn Now use B-tree lookup as before.n
Permuterm problem:
quadruples lexicon sizeEmpirical observation for English.
-
8/3/2019 cs707_011712
16/75
16
Bigram indexes
n Enumerate all k -grams (sequence of k chars)occurring in any term
n e.g., from text April is the cruelest month weget the 2-grams ( bigrams )
n $ is a special word boundary symboln Maintain a second inverted index from bigrams
to dictionary terms that match each bigram.
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$
-
8/3/2019 cs707_011712
17/75
17
Bigram index example
mo
on
among
$m mace
among
amortize
madden
around
The k -gram index finds terms based on a queryconsisting of k- grams (here k= 2).
-
8/3/2019 cs707_011712
18/75
18
Processing n-gram wild-cards
n Query mon* can now be run asn $m AND mo AND on
n Gets terms that match AND version of our wildcard query.
n But wed incorrectly enumerate moon as well.n
Must post-filter these terms against query.n Surviving enumerated terms are then looked up
in the original term-document inverted index.n Fast, space efficient (compared to permuterm).
-
8/3/2019 cs707_011712
19/75
19
Processing wild-card queries
n As before, we must execute a Boolean query for each enumerated, filtered term.
n Wild-cards can result in expensive queryexecution (very large disjunctions )
n Avoid encouraging laziness in the UI:
SearchType your search terms, use * if you need to.E.g., Alex* will match Alexander.
-
8/3/2019 cs707_011712
20/75
20
Advanced features
n Avoiding UI clutter is one reason to hideadvanced features behind an Advanced Searchbutton
n It also deters most users from unnecessarilyhitting the engine with fancy queries
-
8/3/2019 cs707_011712
21/75
21
Spelling correction
-
8/3/2019 cs707_011712
22/75
22
Spell correction
n Two principal usesn Correcting document(s) being indexedn Retrieving matching documents when query
contains a spelling error n Two main flavors:
n Isolated wordn Check each word on its own for misspellingn Will not catch typos resulting in correctly spelled words
e.g., from formn Context-sensitive
n Look at surrounding words, e.g., I flew form Heathrow to Narita.
-
8/3/2019 cs707_011712
23/75
23
Document correction
n Especially needed for OCRed documentsn Correction algorithms tuned for this: rn vs mn Can use domain-specific knowledge
n E.g., OCR can confuse O and D more often than it wouldconfuse O and I (adjacent on the QWERTY keyboard, somore likely interchanged in typing).
n But also: web pages and even printed material
have typosn Goal : the dictionary contains fewer misspellingsn But often we don
t change the documents andinstead fix the query-document mapping
-
8/3/2019 cs707_011712
24/75
24
Query mis-spellings
n Our principal focus heren E.g., the query Alanis Morisett
n We can either n Retrieve documents indexed by the correct
spelling, ORn Return several suggested alternative queries with
the correct spellingn Did you mean ?
-
8/3/2019 cs707_011712
25/75
25
Isolated word correction
n Fundamental premise there is a lexicon fromwhich the correct spellings come
n Two basic choices for thisn A standard lexicon such as
n Websters English Dictionaryn An industry-specific lexicon hand-maintained
n The lexicon of the indexed corpusn E.g., all words on the webn All names, acronyms etc.n (Including the mis-spellings)
-
8/3/2019 cs707_011712
26/75
26
Isolated word correction
n Given a lexicon and a character sequence Q,return the words in the lexicon closest to Q
n Whats closest?n Well study several alternatives
n Edit distancen Weighted edit distancen n-gram overlap
-
8/3/2019 cs707_011712
27/75
27
Edit distance
n Given two strings S 1 and S 2 , the minimumnumber of operations to covert one to the other
n Basic operations are typically character-leveln Insert, Delete, Replace, Transposition
n E.g., the edit distance from dof to dog is 1n From cat to act is 2 (Just 1 with transpose.)n from cat to dog is 3.
n Generally found by dynamic programming.n See http://www.merriampark.com/ld.htm for a
nice example plus an applet.
-
8/3/2019 cs707_011712
28/75
28
Weighted edit distance
n As above, but the weight of an operationdepends on the character(s) involved
n Meant to capture OCR or keyboard errors, e.g., m more likely to be mis-typed as n than as q
n Therefore, replacing m by n is a smaller editdistance than by q
n This may be formulated as a probability modeln Require weight matrix as inputn Modify dynamic programming to handle weights
-
8/3/2019 cs707_011712
29/75
29
Using edit distancesn Given query, first enumerate all character
sequences within a preset (weighted) editdistance (e.g., 2)
n Intersect this set with list of
correct
wordsn Show terms you found to user as suggestionsn Alternatively,
n We can look up all possible corrections in our
inverted index and return all docs slown We can run with a single most likely correction
n The alternatives disempower the user, but save around of interaction with the user
-
8/3/2019 cs707_011712
30/75
Edit distance to all dictionary terms?
n Given a (mis-spelled) query do we compute itsedit distance to every dictionary term?
n Expensive and slown Alternative?
n How do we cut the set of candidate dictionaryterms?
n One possibility is to use n-gram overlap for thisn This can also be used by itself for spelling
correction.
Sec. 3.3.4
30
-
8/3/2019 cs707_011712
31/75
31
n-gram overlap
n Enumerate all the n-grams in the query string aswell as in the lexicon
n Use the n-gram index (recall wild-card search) toretrieve all lexicon terms matching any of thequery n-grams
n Threshold by number of matching n-gramsn Variants weight by keyboard layout, etc.
-
8/3/2019 cs707_011712
32/75
32
Example with trigrams
n Suppose the text is november n Trigrams are nov, ove, vem, emb, mbe, ber .
n The query is december n Trigrams are dec, ece, cem, emb, mbe, ber .
n So 3 trigrams overlap (of 6 in each term)
n How can we turn this into a normalized measureof overlap?
-
8/3/2019 cs707_011712
33/75
33
One option Jaccard coefficient
n A commonly-used measure of overlapn Let X and Y be two sets; then the J.C. is
n Equals 1 when X and Y have the same elementsand zero when they are disjoint
n X and Y dont have to be of the same size
n Always assigns a number between 0 and 1n Now threshold to decide if you have a matchn E.g., if J.C. > 0.8, declare a match
Y X Y X /
-
8/3/2019 cs707_011712
34/75
34
Matching bigrams
n Consider the query lord we wish to identifywords matching 2 of its 3 bigrams ( lo, or, rd )
lo
or
rd
alone lord sloth
lord morbid
border card
border
ardent
Standard postings merge will enumerate
-
8/3/2019 cs707_011712
35/75
lore
lore
Matching trigrams
n Consider the query lord we wish to identifywords matching 2 of its 3 bigrams ( lo, or, rd )
lo
or
rd
alone sloth
morbid
border card
border
ardent
Standard postings
merge
will enumerate
Adapt this to using Jaccard (or another) measure.
Sec. 3.3.4
35
-
8/3/2019 cs707_011712
36/75
-
8/3/2019 cs707_011712
37/75
37
Context-sensitive correctionn Need surrounding context to catch this.
n NLP too heavyweight for this.n First idea : retrieve dictionary terms close (in
weighted edit distance) to each query termn Now try all possible resulting phrases with one
word fixed at a timen flew from heathrow n fled form heathrow n flea form heathrow
n Hit-based spelling correction: Suggest thealternative that has lots of hits.
-
8/3/2019 cs707_011712
38/75
General issues in spell correction
n We enumerate multiple alternatives for
Did youmean?
n Need to figure out which to present to the user
n The alternative hitting most docsn Query log analysis
n More generally, rank alternatives probabilisticallyargmax corr P (corr | query )
n From Bayes rule, this is equivalent toargmax corr P (query | corr ) * P (corr )
Sec. 3.3.5
38Noisy channel Language model
-
8/3/2019 cs707_011712
39/75
39
Computational cost
n Spell-correction is computationally expensive
n Avoid running routinely on every query?
n Run only on queries that matched few docs
-
8/3/2019 cs707_011712
40/75
40
Thesauri
n Thesaurus: language-specific list of synonyms for terms likely to be queried
n car automobile, etc.n Machine learning methods can assist
n Can be viewed as hand-made alternative to edit-distance, etc.
-
8/3/2019 cs707_011712
41/75
41
Query expansion
n Usually do query expansion rather thanindex expansion
n
No index blowupn Query processing slowed down
n Docs frequently contain equivalencesn May retrieve more junk
n puma jaguar retrieves documents on carsinstead of on sneakers.
-
8/3/2019 cs707_011712
42/75
42
Soundex
-
8/3/2019 cs707_011712
43/75
43
Soundex
n Class of heuristics to expand a query intophonetic equivalents
n Language specific mainly for namesn E.g., chebyshev tchebycheff
-
8/3/2019 cs707_011712
44/75
-
8/3/2019 cs707_011712
45/75
45
Soundex typical algorithm
1. Retain the first letter of the word.2. Change all occurrences of the following letters
to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.3. Change letters to digits as follows:n B, F, P, V 1n C, G, J, K, Q, S, X, Z 2n D,T 3n L 4n M, N 5n R 6
-
8/3/2019 cs707_011712
46/75
46
Soundex continued
4. Remove all pairs of consecutive digits.5. Remove all zeros from the resulting string.6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of theform .
E.g., Herman becomes H655.
Will hermann generate the same code?
-
8/3/2019 cs707_011712
47/75
47
Exercise
n Using the algorithm described above, find thesoundex code for your name
n Do you know someone who spells their namedifferently from you, but their name yields thesame soundex code?
-
8/3/2019 cs707_011712
48/75
Soundex
n Soundex is the classic algorithm, provided bymost databases (Oracle, Microsoft, )
n How useful is soundex?n Not very for information retrievaln Okay for
high recall
tasks (e.g., Interpol),though biased to names of certain nationalities
n Zobel and Dart (1996) show that other algorithmsfor phonetic matching perform much better in thecontext of IR
Sec. 3.4
48
-
8/3/2019 cs707_011712
49/75
49
Language detection
n Many of the components described above requirelanguage detection
n For docs/paragraphs at indexing timen For query terms at query time much harder
n For docs/paragraphs, generally have enough textto apply machine learning methods
n For queries, lack sufficient textn Augment with other cues, such as client
properties/specification from applicationn Domain of query origination, etc.
-
8/3/2019 cs707_011712
50/75
50
What queries can we process?
n We haven Basic inverted index with skip pointersn Wild-card indexn Spell-correctionn Soundex
n Queries such as(SPELL(moriset) /3 toron*to) OR
SOUNDEX (chaikofski)
-
8/3/2019 cs707_011712
51/75
51
Aside results caching
n If 25% of your users are searching for britney AND spears
then you probably do need spelling correction,but you dont need to keep on intersecting thosetwo postings lists
n Web query distribution is extremely skewed, andyou can usefully cache results for commonqueries.
n Query log analysis
-
8/3/2019 cs707_011712
52/75
B-Trees52
B-Trees
-
8/3/2019 cs707_011712
53/75
B-Trees 53
Motivation for B-Trees
n Index structures for large datasets cannot bestored in main memory
n Storing it on disk requires different approach toefficiency
n Assuming that a disk spins at 3600 RPM, onerevolution occurs in 1/60 of a second, or 16.7ms
n Crudely speaking, one disk access takes aboutthe same time as 200,000 instructions
-
8/3/2019 cs707_011712
54/75
B-Trees 54
Motivation (cont.)
n Assume that we use an AVL tree to store about20 million records
n We end up with a very deep binary tree with lotsof different disk accesses; log 2 20,000,000 isabout 24, so this takes about 0.2 seconds
n We know we cant improve on the log n lower bound on search for a binary tree
n But, the solution is to use more branches andthus reduce the height of the tree!
n As branching increases, depth decreases
-
8/3/2019 cs707_011712
55/75
B-Trees 55
Definition of a B-tree
n A B-tree of order m is an m-way tree (i.e., a treewhere each node may have up to m children) inwhich:1. the number of keys in each non-leaf node is one
less than the number of its children and thesekeys partition the keys in the children in thefashion of a search tree
2. all leaves are on the same level3. all non-leaf nodes except the root have at leastm / 2 children
4. the root is either a leaf node, or it has from two tom children
-
8/3/2019 cs707_011712
56/75
B-Trees 56
An example B-Tree
51 6242
6 12
26
55 60 7064 9045
1 2 4 7 8 13 15 18 25
27 29 46 48 53
A B-tree of order 5 containing 26items
Note that all the leaves are at the same level
-
8/3/2019 cs707_011712
57/75
B-Trees 57
n Suppose we start with an empty B-tree and keysarrive in the following order:1 12 8 2 25 5 1428 17 7 52 16 48 68 3 26 29 53 55 45
n We want to construct a B-tree of order 5n The first four items go into the root:
n To put the fifth item in the root would violatecondition 5n Therefore, when 25 arrives, pick the middle key
to make a new root
Constructing a B-tree
1 2 8 12
-
8/3/2019 cs707_011712
58/75
B-Trees 58
Constructing a B-tree (contd.)
1 2
8
12 25
6, 14, 28 get added to the leaf nodes:
1 2
8
12 146 25 28
-
8/3/2019 cs707_011712
59/75
B-Trees 59
Constructing a B-tree (contd.)
Adding 17 to the right leaf node would over-fill it, so we take themiddle key, promote it (to the root) and split the leaf
8 17
12 14 25 281 2 6
7, 52, 16, 48 get added to the leaf nodes8 17
12 14 25 281 2 6 16 48 527
-
8/3/2019 cs707_011712
60/75
B-Trees 60
Constructing a B-tree (contd.)
Adding 68 causes us to split the right most leaf, promoting 48 to theroot, and adding 3 causes us to split the left most leaf, promoting 3to the root; 26, 29, 53, 55 then go into the leaves
3 8 17 48
52 53 55 6825 26 28 291 2 6 7 12 14 16
Adding 45 causes a split of 25 26 28 29
and promoting 28 to the root then causes the root to split
-
8/3/2019 cs707_011712
61/75
B-Trees 61
Constructing a B-tree (contd.)
17
3 8 28 48
1 2 6 7 12 14 16 52 53 55 6825 26 29 45
-
8/3/2019 cs707_011712
62/75
B-Trees 62
Inserting into a B-Tree
n Attempt to insert the new key into a leaf n If this would result in that leaf becoming too big,
split the leaf into two, promoting the middle key tothe leafs parent
n If this would result in the parent becoming toobig, split the parent into two, promoting themiddle key
n This strategy might have to be repeated all theway to the top
n If necessary, the root is split in two and themiddle key is promoted to a new root, making the
tree one level higher
-
8/3/2019 cs707_011712
63/75
B-Trees 63
Exercise in Inserting a B-Tree
n Insert the following keys to a 5-way B-tree:n 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4,
31, 35, 56
n
-
8/3/2019 cs707_011712
64/75
B-Trees 64
Removal from a B-treen During insertion, the key always goes into a leaf .
For deletion we wish to remove from a leaf.There are three possible ways we can do this:
n 1 - If the key is already in a leaf node, andremoving it doesnt cause that leaf node to havetoo few keys, then simply remove the key to bedeleted.
n 2 - If the key is not in a leaf then it is guaranteed(by the nature of a B-tree) that its predecessor or successor will be in a leaf -- in this case we candelete the key and promote the predecessor or successor key to the non-leaf deleted keysposition.
-
8/3/2019 cs707_011712
65/75
B-Trees 65
Removal from a B-tree (2)
n If (1) or (2) lead to a leaf node containing lessthan the minimum number of keys then we haveto look at the siblings immediately adjacent to the
leaf in question:n 3: if one of them has more than the min. number of
keys then we can promote one of its keys to theparent and take the parent key into our lacking leaf
n
4: if neither of them has more than the min.number of keys then the lacking leaf and one of itsneighbours can be combined with their sharedparent (the opposite of promoting a key) and thenew leaf will have the correct number of keys; if
this step leave the parent with too few keys then
-
8/3/2019 cs707_011712
66/75
B-Trees 66
Type #1: Simple leaf deletion
12 29 52
2 7 9 15 22 56 69 72 31 43
Delete 2: Since there are enoughkeys in the node, just delete it
Assuming a 5-wayB-Tree, as before...
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
67/75
B-Trees 67
Type #2: Simple non-leaf deletion
12 29 52
7 9 15 22 56 69 72 31 43
Delete 52
Borrow the p redecessoror (in this case) successor
56
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
68/75
B-Trees 68
Type #4: Too few keys in nodeand its siblings
12 29 56
7 9 15 22 69 72 31 43
Delete 72
Too fewkeys!
Join back together
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
69/75
B-Trees 69
Type #4: Too few keys in nodeand its siblings
12 29
7 9 15 22 69 56 31 43
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
70/75
B-Trees 70
Type #3: Enough siblings
12 29
7 9 15 22 69 56 31 43
Delete 22
Demote root key andpromote leaf key
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
71/75
B-Trees 71
Type #3: Enough siblings
12
29 7 9 15
31
69 56 43
Note when printed: this slide is animated
-
8/3/2019 cs707_011712
72/75
B-Trees 72
Exercise in Removal from a B-Tree
n Given 5-way B-tree created by these data (lastexercise):
n 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4,31, 35, 56
n Add these further keys: 2, 6,12
n Delete these keys: 4, 5, 7, 3, 14
n
-
8/3/2019 cs707_011712
73/75
B-Trees 73
Analysis of B-Treesn The maximum number of items in a B-tree of order m and height
h:root m 1level 1 m(m 1)
level 2 m2(m 1). . .level h mh(m 1)
n So, the total number of items is(1 + m + m 2 + m3 + + m h)(m 1) =
[(mh+1 1)/ ( m 1)] ( m 1) = m h+1 1 n When m = 5 and h = 2 this gives 5 3 1 = 124
-
8/3/2019 cs707_011712
74/75
B-Trees 74
Reasons for using B-Treesn When searching tables held on disc, the cost of
each disc transfer is high but doesn't dependmuch on the amount of data transferred,especially if consecutive items are transferred
n If we use a B-tree of order 101, say, we cantransfer each node in one disc read operation
n A B-tree of order 101 and height 3 can hold 101 4 1 items (approximately 100 million) and any item
can be accessed with 3 disc reads (assuming wehold the root in memory)n If we take m = 3, we get a 2-3 tree , in which non-
leaf nodes have two or three children (i.e., one or two keys)
n B-Trees are alwa s balanced since the leaves are
-
8/3/2019 cs707_011712
75/75
B-Trees 75
Comparing Trees
n Binary treesn Can become unbalanced and lose their good time
complexity (big O)n AVL trees are strict binary trees that overcome the
balance problem n Heaps remain balanced but only prioritise (not
order) the keys
n Multi-way treesn B-Trees can be m -way, they can have any (odd)
number of childrenOne B Tree the 2 3 (or 3 way) B Tree