cs707_011712

8/3/2019 cs707_011712

1/75

Prasad L05TolerantIR 1

Tolerant IR

Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and

Christopher Manning (Stanford)

8/3/2019 cs707_011712

2/75

8/3/2019 cs707_011712

3/75

Dictionary data structures for

inverted indexesn The dictionary data structure stores the term

vocabulary, document frequency, pointers toeach postings list in what data structure?

Sec. 3.1

3

8/3/2019 cs707_011712

4/75

A nave dictionaryn An array of struct:

char[20] int Postings *20 bytes 4/8 bytes 4/8 bytes

n How do we store a dictionary in memory efficiently?n How do we quickly look up elements at query time?

Sec. 3.1

4

8/3/2019 cs707_011712

5/75

Dictionary data structuresn Two main choices:

n Hashtablesn Trees

n Some IR systems use hashtables, some trees

Sec. 3.1

5

8/3/2019 cs707_011712

6/75

Hashtablesn Each vocabulary term is hashed to an integer

n (We assume you

ve seen hashtables before)n

Pros:n Lookup is faster than for a tree: O(1)

n Cons:n No easy way to find minor variants:

n judgment/judgementn No prefix search [tolerant retrieval]n If vocabulary keeps growing, need to occasionally

do the expensive operation of rehashingeverything

Sec. 3.1

6

8/3/2019 cs707_011712

7/75

Roota-m n-z

a-hu hy-m n-sh si-z

Tree: binary tree

Sec. 3.1

7

8/3/2019 cs707_011712

8/75

Tree: B-tree

n Definition: Every internal nodel has a number of children in the interval [ a ,b] where a, b areappropriate natural numbers, e.g., [2,4].

a-huhy-m

n-z

Sec. 3.1

8

8/3/2019 cs707_011712

9/75

Treesn Simplest: binary treen More usual: B-treesn Trees require a standard ordering of characters and

hence strings but we typically have onen Pros:

n Solves the prefix problem (terms starting with hyp )n Cons:

n Slower: O(log M ) [and this requires balanced tree]n Rebalancing binary trees is expensive

n But B-trees mitigate the rebalancing problem

Sec. 3.1

9

8/3/2019 cs707_011712

10/75

10

Wild-card queries

8/3/2019 cs707_011712

11/75

11

Wild-card queries: *n mon*: find all docs containing any word beginning

with mon.n Hashing unsuitable because order not preservedn Easy with binary tree (or B-tree) lexicon: retrieve all

words in range: mon w < moon *mon: find words ending in mon: harder

n Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom w < non .

Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?

8/3/2019 cs707_011712

12/75

12

Query processingn At this point, we have an enumeration of all terms

in the dictionary that match the wild-card query.n We still have to look up the postings for each

enumerated term.n E.g., consider the query:

se*ate AND fil*er

This may result in the execution of many Boolean AND queries.

8/3/2019 cs707_011712

13/75

13

B-trees handle *s at the end of aquery term

n How can we handle *s in the middle of queryterm? (Especially multiple *s)

n Consider co*tionn We could look up co* AND *tion in a B-tree and

intersect the two term setsn Expensive

n The solution : transform every wild-card query sothat the *s occur at the end

n This gives rise to the Permuterm Index .

8/3/2019 cs707_011712

14/75

14

Permuterm index

n For term hello index under:n hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.

n Queries:n X lookup on X$ *X lookup on X$* n *X* lookup on X*n X*Y lookup on Y$X* X*Y*Z ???

Query = hel*o X= hel, Y= o

Lookup o $ hel*

8/3/2019 cs707_011712

15/75

15

Permuterm query processing

n Rotate query wild-card to the rightn Now use B-tree lookup as before.n

Permuterm problem:

quadruples lexicon sizeEmpirical observation for English.

8/3/2019 cs707_011712

16/75

16

Bigram indexes

n Enumerate all k -grams (sequence of k chars)occurring in any term

n e.g., from text April is the cruelest month weget the 2-grams ( bigrams )

n $ is a special word boundary symboln Maintain a second inverted index from bigrams

to dictionary terms that match each bigram.

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$

8/3/2019 cs707_011712

17/75

17

Bigram index example

mo

on

among

$m mace

among

amortize

madden

around

The k -gram index finds terms based on a queryconsisting of k- grams (here k= 2).

8/3/2019 cs707_011712

18/75

18

Processing n-gram wild-cards

n Query mon* can now be run asn $m AND mo AND on

n Gets terms that match AND version of our wildcard query.

n But wed incorrectly enumerate moon as well.n

Must post-filter these terms against query.n Surviving enumerated terms are then looked up

in the original term-document inverted index.n Fast, space efficient (compared to permuterm).

8/3/2019 cs707_011712

19/75

19

Processing wild-card queries

n As before, we must execute a Boolean query for each enumerated, filtered term.

n Wild-cards can result in expensive queryexecution (very large disjunctions )

n Avoid encouraging laziness in the UI:

SearchType your search terms, use * if you need to.E.g., Alex* will match Alexander.

8/3/2019 cs707_011712

20/75

20

Advanced features

n Avoiding UI clutter is one reason to hideadvanced features behind an Advanced Searchbutton

n It also deters most users from unnecessarilyhitting the engine with fancy queries

8/3/2019 cs707_011712

21/75

21

Spelling correction

8/3/2019 cs707_011712

22/75

22

Spell correction

n Two principal usesn Correcting document(s) being indexedn Retrieving matching documents when query

contains a spelling error n Two main flavors:

n Isolated wordn Check each word on its own for misspellingn Will not catch typos resulting in correctly spelled words

e.g., from formn Context-sensitive

n Look at surrounding words, e.g., I flew form Heathrow to Narita.

8/3/2019 cs707_011712

23/75

23

Document correction

n Especially needed for OCRed documentsn Correction algorithms tuned for this: rn vs mn Can use domain-specific knowledge

n E.g., OCR can confuse O and D more often than it wouldconfuse O and I (adjacent on the QWERTY keyboard, somore likely interchanged in typing).

n But also: web pages and even printed material

have typosn Goal : the dictionary contains fewer misspellingsn But often we don

t change the documents andinstead fix the query-document mapping

8/3/2019 cs707_011712

24/75

24

Query mis-spellings

n Our principal focus heren E.g., the query Alanis Morisett

n We can either n Retrieve documents indexed by the correct

spelling, ORn Return several suggested alternative queries with

the correct spellingn Did you mean ?

8/3/2019 cs707_011712

25/75

25

Isolated word correction

n Fundamental premise there is a lexicon fromwhich the correct spellings come

n Two basic choices for thisn A standard lexicon such as

n Websters English Dictionaryn An industry-specific lexicon hand-maintained

n The lexicon of the indexed corpusn E.g., all words on the webn All names, acronyms etc.n (Including the mis-spellings)

8/3/2019 cs707_011712

26/75

26

Isolated word correction

n Given a lexicon and a character sequence Q,return the words in the lexicon closest to Q

n Whats closest?n Well study several alternatives

n Edit distancen Weighted edit distancen n-gram overlap

8/3/2019 cs707_011712

27/75

27

Edit distance

n Given two strings S 1 and S 2 , the minimumnumber of operations to covert one to the other

n Basic operations are typically character-leveln Insert, Delete, Replace, Transposition

n E.g., the edit distance from dof to dog is 1n From cat to act is 2 (Just 1 with transpose.)n from cat to dog is 3.

n Generally found by dynamic programming.n See http://www.merriampark.com/ld.htm for a

nice example plus an applet.

8/3/2019 cs707_011712

28/75

28

Weighted edit distance

n As above, but the weight of an operationdepends on the character(s) involved

n Meant to capture OCR or keyboard errors, e.g., m more likely to be mis-typed as n than as q

n Therefore, replacing m by n is a smaller editdistance than by q

n This may be formulated as a probability modeln Require weight matrix as inputn Modify dynamic programming to handle weights

8/3/2019 cs707_011712

29/75

29

Using edit distancesn Given query, first enumerate all character

sequences within a preset (weighted) editdistance (e.g., 2)

n Intersect this set with list of

correct

wordsn Show terms you found to user as suggestionsn Alternatively,

n We can look up all possible corrections in our

inverted index and return all docs slown We can run with a single most likely correction

n The alternatives disempower the user, but save around of interaction with the user

8/3/2019 cs707_011712

30/75

Edit distance to all dictionary terms?

n Given a (mis-spelled) query do we compute itsedit distance to every dictionary term?

n Expensive and slown Alternative?

n How do we cut the set of candidate dictionaryterms?

n One possibility is to use n-gram overlap for thisn This can also be used by itself for spelling

correction.

Sec. 3.3.4

30

8/3/2019 cs707_011712

31/75

31

n-gram overlap

n Enumerate all the n-grams in the query string aswell as in the lexicon

n Use the n-gram index (recall wild-card search) toretrieve all lexicon terms matching any of thequery n-grams

n Threshold by number of matching n-gramsn Variants weight by keyboard layout, etc.

8/3/2019 cs707_011712

32/75

32

Example with trigrams

n Suppose the text is november n Trigrams are nov, ove, vem, emb, mbe, ber .

n The query is december n Trigrams are dec, ece, cem, emb, mbe, ber .

n So 3 trigrams overlap (of 6 in each term)

n How can we turn this into a normalized measureof overlap?

8/3/2019 cs707_011712

33/75

33

One option Jaccard coefficient

n A commonly-used measure of overlapn Let X and Y be two sets; then the J.C. is

n Equals 1 when X and Y have the same elementsand zero when they are disjoint

n X and Y dont have to be of the same size

n Always assigns a number between 0 and 1n Now threshold to decide if you have a matchn E.g., if J.C. > 0.8, declare a match

Y X Y X /

8/3/2019 cs707_011712

34/75

34

Matching bigrams

n Consider the query lord we wish to identifywords matching 2 of its 3 bigrams ( lo, or, rd )

lo

or

rd

alone lord sloth

lord morbid

border card

border

ardent

Standard postings merge will enumerate

8/3/2019 cs707_011712

35/75

lore

lore

Matching trigrams

n Consider the query lord we wish to identifywords matching 2 of its 3 bigrams ( lo, or, rd )

lo

or

rd

alone sloth

morbid

border card

border

ardent

Standard postings

merge

will enumerate

Adapt this to using Jaccard (or another) measure.

Sec. 3.3.4

35

8/3/2019 cs707_011712

36/75

8/3/2019 cs707_011712

37/75

37

Context-sensitive correctionn Need surrounding context to catch this.

n NLP too heavyweight for this.n First idea : retrieve dictionary terms close (in

weighted edit distance) to each query termn Now try all possible resulting phrases with one

word fixed at a timen flew from heathrow n fled form heathrow n flea form heathrow

n Hit-based spelling correction: Suggest thealternative that has lots of hits.

8/3/2019 cs707_011712

38/75

General issues in spell correction

n We enumerate multiple alternatives for

Did youmean?

n Need to figure out which to present to the user

n The alternative hitting most docsn Query log analysis

n More generally, rank alternatives probabilisticallyargmax corr P (corr | query )

n From Bayes rule, this is equivalent toargmax corr P (query | corr ) * P (corr )

Sec. 3.3.5

38Noisy channel Language model

8/3/2019 cs707_011712

39/75

39

Computational cost

n Spell-correction is computationally expensive

n Avoid running routinely on every query?

n Run only on queries that matched few docs

8/3/2019 cs707_011712

40/75

40

Thesauri

n Thesaurus: language-specific list of synonyms for terms likely to be queried

n car automobile, etc.n Machine learning methods can assist

n Can be viewed as hand-made alternative to edit-distance, etc.

8/3/2019 cs707_011712

41/75

41

Query expansion

n Usually do query expansion rather thanindex expansion

n

No index blowupn Query processing slowed down

n Docs frequently contain equivalencesn May retrieve more junk

n puma jaguar retrieves documents on carsinstead of on sneakers.

8/3/2019 cs707_011712

42/75

42

Soundex

8/3/2019 cs707_011712

43/75

43

Soundex

n Class of heuristics to expand a query intophonetic equivalents

n Language specific mainly for namesn E.g., chebyshev tchebycheff

8/3/2019 cs707_011712

44/75

8/3/2019 cs707_011712

45/75

45

Soundex typical algorithm

1. Retain the first letter of the word.2. Change all occurrences of the following letters

to '0' (zero):

'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.3. Change letters to digits as follows:n B, F, P, V 1n C, G, J, K, Q, S, X, Z 2n D,T 3n L 4n M, N 5n R 6

8/3/2019 cs707_011712

46/75

46

Soundex continued

4. Remove all pairs of consecutive digits.5. Remove all zeros from the resulting string.6. Pad the resulting string with trailing zeros and

return the first four positions, which will be of theform .

E.g., Herman becomes H655.

Will hermann generate the same code?

8/3/2019 cs707_011712

47/75

47

Exercise

n Using the algorithm described above, find thesoundex code for your name

n Do you know someone who spells their namedifferently from you, but their name yields thesame soundex code?

8/3/2019 cs707_011712

48/75

Soundex

n Soundex is the classic algorithm, provided bymost databases (Oracle, Microsoft, )

n How useful is soundex?n Not very for information retrievaln Okay for

high recall

tasks (e.g., Interpol),though biased to names of certain nationalities

n Zobel and Dart (1996) show that other algorithmsfor phonetic matching perform much better in thecontext of IR

Sec. 3.4

48

8/3/2019 cs707_011712

49/75

49

Language detection

n Many of the components described above requirelanguage detection

n For docs/paragraphs at indexing timen For query terms at query time much harder

n For docs/paragraphs, generally have enough textto apply machine learning methods

n For queries, lack sufficient textn Augment with other cues, such as client

properties/specification from applicationn Domain of query origination, etc.

8/3/2019 cs707_011712

50/75

50

What queries can we process?

n We haven Basic inverted index with skip pointersn Wild-card indexn Spell-correctionn Soundex

n Queries such as(SPELL(moriset) /3 toron*to) OR

SOUNDEX (chaikofski)

8/3/2019 cs707_011712

51/75

51

Aside results caching

n If 25% of your users are searching for britney AND spears

then you probably do need spelling correction,but you dont need to keep on intersecting thosetwo postings lists

n Web query distribution is extremely skewed, andyou can usefully cache results for commonqueries.

n Query log analysis

8/3/2019 cs707_011712

52/75

B-Trees52

B-Trees

8/3/2019 cs707_011712

53/75

B-Trees 53

Motivation for B-Trees

n Index structures for large datasets cannot bestored in main memory

n Storing it on disk requires different approach toefficiency

n Assuming that a disk spins at 3600 RPM, onerevolution occurs in 1/60 of a second, or 16.7ms

n Crudely speaking, one disk access takes aboutthe same time as 200,000 instructions

8/3/2019 cs707_011712

54/75

B-Trees 54

Motivation (cont.)

n Assume that we use an AVL tree to store about20 million records

n We end up with a very deep binary tree with lotsof different disk accesses; log 2 20,000,000 isabout 24, so this takes about 0.2 seconds

n We know we cant improve on the log n lower bound on search for a binary tree

n But, the solution is to use more branches andthus reduce the height of the tree!

n As branching increases, depth decreases

8/3/2019 cs707_011712

55/75

B-Trees 55

Definition of a B-tree

n A B-tree of order m is an m-way tree (i.e., a treewhere each node may have up to m children) inwhich:1. the number of keys in each non-leaf node is one

less than the number of its children and thesekeys partition the keys in the children in thefashion of a search tree

2. all leaves are on the same level3. all non-leaf nodes except the root have at leastm / 2 children

4. the root is either a leaf node, or it has from two tom children

8/3/2019 cs707_011712

56/75

B-Trees 56

An example B-Tree

51 6242

6 12

26

55 60 7064 9045

1 2 4 7 8 13 15 18 25

27 29 46 48 53

A B-tree of order 5 containing 26items

Note that all the leaves are at the same level

8/3/2019 cs707_011712

57/75

B-Trees 57

n Suppose we start with an empty B-tree and keysarrive in the following order:1 12 8 2 25 5 1428 17 7 52 16 48 68 3 26 29 53 55 45

n We want to construct a B-tree of order 5n The first four items go into the root:

n To put the fifth item in the root would violatecondition 5n Therefore, when 25 arrives, pick the middle key

to make a new root

Constructing a B-tree

1 2 8 12

8/3/2019 cs707_011712

58/75

B-Trees 58

Constructing a B-tree (contd.)

1 2

8

12 25

6, 14, 28 get added to the leaf nodes:

1 2

8

12 146 25 28

8/3/2019 cs707_011712

59/75

B-Trees 59


Adding 17 to the right leaf node would over-fill it, so we take themiddle key, promote it (to the root) and split the leaf

8 17

12 14 25 281 2 6

7, 52, 16, 48 get added to the leaf nodes8 17

12 14 25 281 2 6 16 48 527

8/3/2019 cs707_011712

60/75

B-Trees 60


Adding 68 causes us to split the right most leaf, promoting 48 to theroot, and adding 3 causes us to split the left most leaf, promoting 3to the root; 26, 29, 53, 55 then go into the leaves

3 8 17 48

52 53 55 6825 26 28 291 2 6 7 12 14 16

Adding 45 causes a split of 25 26 28 29

and promoting 28 to the root then causes the root to split

8/3/2019 cs707_011712

61/75

B-Trees 61


17

3 8 28 48

1 2 6 7 12 14 16 52 53 55 6825 26 29 45

8/3/2019 cs707_011712

62/75

B-Trees 62

Inserting into a B-Tree

n Attempt to insert the new key into a leaf n If this would result in that leaf becoming too big,

split the leaf into two, promoting the middle key tothe leafs parent

n If this would result in the parent becoming toobig, split the parent into two, promoting themiddle key

n This strategy might have to be repeated all theway to the top

n If necessary, the root is split in two and themiddle key is promoted to a new root, making the

tree one level higher

8/3/2019 cs707_011712

63/75

B-Trees 63

Exercise in Inserting a B-Tree

n Insert the following keys to a 5-way B-tree:n 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4,

31, 35, 56

n

8/3/2019 cs707_011712

64/75

B-Trees 64

Removal from a B-treen During insertion, the key always goes into a leaf .

For deletion we wish to remove from a leaf.There are three possible ways we can do this:

n 1 - If the key is already in a leaf node, andremoving it doesnt cause that leaf node to havetoo few keys, then simply remove the key to bedeleted.

n 2 - If the key is not in a leaf then it is guaranteed(by the nature of a B-tree) that its predecessor or successor will be in a leaf -- in this case we candelete the key and promote the predecessor or successor key to the non-leaf deleted keysposition.

8/3/2019 cs707_011712

65/75

B-Trees 65

Removal from a B-tree (2)

n If (1) or (2) lead to a leaf node containing lessthan the minimum number of keys then we haveto look at the siblings immediately adjacent to the

leaf in question:n 3: if one of them has more than the min. number of

keys then we can promote one of its keys to theparent and take the parent key into our lacking leaf

n

4: if neither of them has more than the min.number of keys then the lacking leaf and one of itsneighbours can be combined with their sharedparent (the opposite of promoting a key) and thenew leaf will have the correct number of keys; if

this step leave the parent with too few keys then

8/3/2019 cs707_011712

66/75

B-Trees 66

Type #1: Simple leaf deletion

12 29 52

2 7 9 15 22 56 69 72 31 43

Delete 2: Since there are enoughkeys in the node, just delete it

Assuming a 5-wayB-Tree, as before...

Note when printed: this slide is animated

8/3/2019 cs707_011712

67/75

B-Trees 67

Type #2: Simple non-leaf deletion

12 29 52

7 9 15 22 56 69 72 31 43

Delete 52

Borrow the p redecessoror (in this case) successor

56


8/3/2019 cs707_011712

68/75

B-Trees 68

Type #4: Too few keys in nodeand its siblings

12 29 56

7 9 15 22 69 72 31 43

Delete 72

Too fewkeys!

Join back together


8/3/2019 cs707_011712

69/75

B-Trees 69

Type #4: Too few keys in nodeand its siblings

12 29

7 9 15 22 69 56 31 43


8/3/2019 cs707_011712

70/75

B-Trees 70

Type #3: Enough siblings

12 29

7 9 15 22 69 56 31 43

Delete 22

Demote root key andpromote leaf key


8/3/2019 cs707_011712

71/75

B-Trees 71

Type #3: Enough siblings

12

29 7 9 15

31

69 56 43


8/3/2019 cs707_011712

72/75

B-Trees 72

Exercise in Removal from a B-Tree

n Given 5-way B-tree created by these data (lastexercise):

n 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4,31, 35, 56

n Add these further keys: 2, 6,12

n Delete these keys: 4, 5, 7, 3, 14

n

8/3/2019 cs707_011712

73/75

B-Trees 73

Analysis of B-Treesn The maximum number of items in a B-tree of order m and height

h:root m 1level 1 m(m 1)

level 2 m2(m 1). . .level h mh(m 1)

n So, the total number of items is(1 + m + m 2 + m3 + + m h)(m 1) =

[(mh+1 1)/ ( m 1)] ( m 1) = m h+1 1 n When m = 5 and h = 2 this gives 5 3 1 = 124

8/3/2019 cs707_011712

74/75

B-Trees 74

Reasons for using B-Treesn When searching tables held on disc, the cost of

each disc transfer is high but doesn't dependmuch on the amount of data transferred,especially if consecutive items are transferred

n If we use a B-tree of order 101, say, we cantransfer each node in one disc read operation

n A B-tree of order 101 and height 3 can hold 101 4 1 items (approximately 100 million) and any item

can be accessed with 3 disc reads (assuming wehold the root in memory)n If we take m = 3, we get a 2-3 tree , in which non-

leaf nodes have two or three children (i.e., one or two keys)

n B-Trees are alwa s balanced since the leaves are

8/3/2019 cs707_011712

75/75

B-Trees 75

Comparing Trees

n Binary treesn Can become unbalanced and lose their good time

complexity (big O)n AVL trees are strict binary trees that overcome the

balance problem n Heaps remain balanced but only prioritise (not

order) the keys

n Multi-way treesn B-Trees can be m -way, they can have any (odd)

number of childrenOne B Tree the 2 3 (or 3 way) B Tree

cs707_011712

Documents

Transcript of cs707_011712