Lecture 2: The term vocabulary and postings lists Related to Chapter 2:

Introduction to Information Retrieval

Introduction to

Information Retrieval

Lecture 2: The term vocabulary and postings lists

Related to Chapter 2: http://nlp.stanford.edu/IR-book/pdf/02voc.pdf

http://nlp.stanford.edu/IR-book/pdf/02voc.pdf


Recap of the previous lecture Basic inverted indexes:

Structure: Dictionary and Postings

Boolean query processing Intersection by linear time “merging” Simple optimizations

Ch. 1

2


Plan for this lecture Preprocessing to form the term vocabulary

Tokenization Normalization

Postings Faster merges: skip lists Positional postings and phrase queries

3


Recall the basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countrymanIndexe

rInverted index.

friend

roman

countryman

2 42

13 161

Documents tobe indexed. Friends, Romans, countrymen.

4


The first step: Parsing a document What format is it in?

pdf/word/excel/html What language is it in? What character set is in use?

Sec. 2.1

5


Complications: Format/language Documents being indexed can include docs from

many different languages A single index may have to contain terms of several

languages.

Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment.

Sec. 2.1

6


TOKENIZATION

7


Tokenization

Given a character sequence, tokenization is the task

of chopping it up into pieces, called tokens.

Perhaps at the same time throwing away certain

characters, such as punctuation.

8


Tokenization Input: “university of Qom, computer department” Output: Tokens

university of Qom computer Department

Each such token is now a candidate for an index entry, after further processing: Normalization.

But what are valid tokens to emit?

Sec. 2.2.1

9


Issues in tokenization Iran’s capital Iran? Irans? Iran’s?

Hyphen Hewlett-Packard Hewlett and Packard as two tokens? the hold-him-back-and-drag-him-away maneuver co-author lowercase, lower-case, lower case ?

Space San Francisco: How do you decide it is one token?

Sec. 2.2.1

10


Issues in tokenization Numbers

Older IR systems may not index numbers But often very useful: looking up error codes/stack traces on the

web

3/12/91 Mar. 12, 199112/3/91

55 B.C. B-52 My PGP key is 324a3df234cb23e (800) 234-2333

Sec. 2.2.1

11


Language issues in tokenization German noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’

German retrieval systems benefit greatly from a compound splitter module

Can give a 15% performance boost for German

Sec. 2.2.1

12


Language issues in tokenization Chinese and Japanese have no spaces between

words: 莎拉波娃现在居住在美国东南部的佛罗里达。

Further complicated in Japanese, with multiple alphabets intermingled

フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )

Katakana Hiragana Kanji Romaji

Sec. 2.2.1

13


Stop words With a stop list, you exclude from the dictionary words like the, a,

and, to, be Intuition: They have little semantic content. The commonest words.

Using a stop list significantly reduces the number of postings that a system has to store, because there are a lot of them.

Two ways for construction: By expert By machine

Sec. 2.2.2

14


Stop words But you need them for:

Phrase queries: “President of Iran” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”

The general trend in IR systems: from large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list.

Good compression techniques (lecture 5) means the space for including stop words in a system is very small

Good query optimization techniques (lecture 7) mean you pay little at query time for including stop words

15


NORMALIZATION

16


Normalization to terms Example: We want to match I.R. and IR

Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

Result is terms: A term is a (normalized) word type, which is an entry in

our IR system dictionary

Sec. 2.2.3

17


Normalization Example: Case folding Reduce all letters to lower case

Exception: upper case in mid-sentence?

Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization.

Sec. 2.2.3

18


Normalization to terms One way is using Equivalence Classes:

car = automobile color = colour Searches for one term will retrieve documents that contain

each of these members.

We most commonly implicitly define equivalence classes of terms rather than being fully calculated in advance (hand constructed), e.g., deleting periods to form a term

I.R., IR deleting hyphens to form a term

anti-discriminatory, antidiscriminatory 19


Normalization: other languages Accents: e.g., French résumé vs. resume. Umlauts: e.g., German: Tuebingen vs. Tübingen Normalization of things like date forms

7 月 30 日 vs. 7/30

Tokenization and normalization may depend on the language and so is intertwined with language detection

Crucial: Need to “normalize” indexed text as well as query terms into the same form

Sec. 2.2.3

20


Normalization to terms What is the disadvantage of equivalence classing?

An example: Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows

An alternative to equivalence classing is to do asymmetric expansion It is hand constructed

Potentially more powerful, but less efficient

Sec. 2.2.3

21


Stemming and lemmatization Documents are going to use different forms of a

word, organize, organizes, and organizing am, are, and is

There are families of derivationally related words with similar meanings, democracy, democratic, and democratization.

Reduce tokens to their “roots” before indexing.22


Stemming “Stemming” suggest crude affix chopping

language dependent

Example: Porter’s algorithm

http://www.tartarus.org/~martin/PorterStemmer Lovins stemmer

http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Sec. 2.2.4

23

http://www.tartarus.org/martin/PorterStemmer




Typical rules in Porter sses ss presses press ies i

bodies bodi ss ss

press press s

cats cat

Sec. 2.2.4

24


Lemmatization Reduce inflectional/variant forms to base form

(lemma) properly with the use of a vocabulary and morphological analysis of words

Lemmatizer: a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word.

Sec. 2.2.4

25


Language-specificity Many of the above features embody transformations

that are Language-specific and often, application-specific

There are “plug-in” addenda to the indexing process

Both open source and commercial plug-ins are available for handling these

Sec. 2.2.4

26


Helpfulness of normalization Do stemming and other normalizations help?

Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish!

What about English? Not so considerable help! Helps a lot for some queries, hurts performance a lot for others.

27


Helpfulness of normalization Example:

operative (dentistry) ⇒ oper operational (research) ⇒ oper operating (systems) ⇒ oper

For a case like this, moving to using a lemmatizer would not completely fix the problem

28


FASTER POSTINGS MERGES:SKIP LISTS

29


Recall basic merge Walk through the two postings simultaneously, in

time linear in the total number of postings entries

12831

2 4 8 41 48 641 2 3 8 11 17 21

BrutusCaesar2 8

If the list lengths are m and n, the merge takes O(m+n)operations.

Can we do better?Yes!

Sec. 2.3

30


Augment postings with skip pointers

At indexing time.

The resulted list is skip list.

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec. 2.3

31


Query processing with skip pointers

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Suppose we’ve stepped through the lists until we process 8 on each list.

We then have 41 and 11. 11 is smaller.

But the skip successor of 11 is 31, so we can skip ahead past the intervening postings.

Sec. 2.3

32


Where do we place skips? Tradeoff:

More skips More likely to skip. But lots of comparisons to skip pointers.

Fewer skips Few successful skips. But few pointer comparison.

Sec. 2.3

33


Placing skips Simple heuristic: for postings of length L, use L

evenly-spaced skip pointers.

This ignores the distribution of query terms.

Easy if the index is relatively static; harder if L keeps changing because of updates.

The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

Sec. 2.3

34

D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.


PHRASE QUERIES AND POSITIONAL INDEXES

35


Phrase queries Want to be able to answer queries such as “stanford

university” – as a phrase

Thus the sentence “I went to university at Stanford” is not a match.

Most recent search engines support a double quotes syntax

Sec. 2.4

36


Phrase queries PHRASE QUERIES has proven to be very easily

understood and successfully used by users.

As many as 10% of web queries are phrase queries.

For this, it no longer suffices to store only <term : docs> entries

Solutions?

37


A first attempt: Biword indexes Index every consecutive pair of terms in the text as a

phrase

For example the text “Qom computer department” would generate the biwords Qom computer computer department

Two-word phrase query-processing is now immediate.

Sec. 2.4.1

38


Longer phrase queries The query “modern information retrieval course” can

be broken into the Boolean query on biwords: modern information AND information retrieval AND

retrieval course

Work fairly well in practice.

But there can and will be occasional errors.

39


Issues for biword indexes Errors, as noted before.

Index blowup due to bigger dictionary Infeasible for more than biwords, big even for them.

Biword indexes are not the standard solution but can be part of a compound strategy.

Sec. 2.4.1

40


Solution 2: Positional indexes In the postings, store for each term the position(s) in

which tokens of it appear:

<term, number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>

Sec. 2.4.2

41


Positional index example

For phrase queries, we need to deal with more than just equality.

<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>

Which of docs 1,2,4,5could contain “to be

or not to be”?

Sec. 2.4.2

42


Proximity queries LIMIT /3 STATUTE /3 FEDERAL /2 TORT

/k means “within k words of (on either side)”.

Clearly, positional indexes can be used for such queries; biword indexes cannot.

Sec. 2.4.2

43


44


Positional index size Need an entry for each occurrence, not just once per

document

Consider a term with frequency 0.1%

1001100,000111000

Positional postingsPostingsDocument size

Sec. 2.4.2

45


Rules of thumb A positional index is 2–4 as large as a non-positional

index

Positional index size 35–50% of volume of original text

Caveat: all of this holds for “English-like” languages

Sec. 2.4.2

46


Combination schemes

These two approaches can be profitably combined For particular phrases (“Hossein Rezazadeh”) it is

inefficient to keep on merging positional postings lists

Sec. 2.4.3

47


Combination schemes Williams et al. (2004) evaluate a more sophisticated

mixed indexing scheme A typical web query mixture was executed in ¼ of the time

of using just a positional index It required 26% more space than having a positional index

alone

H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.

48


Exercise Write a pseudo-code for biwork phrase queries using

positional index.

Do exercises 2.5 and 2.6 of your book.

49

Lecture 2: The term vocabulary and postings lists Related to Chapter 2:

Documents

Transcript of Lecture 2: The term vocabulary and postings lists Related to Chapter 2: