Information Retrieval Lecture 2: The term vocabulary and postings lists.

Information Retrieval

Lecture 2: The term vocabulary and postings lists

Plan for this lecture

Elaborate basic indexing Preprocessing to form the term

vocabulary Documents Tokenization What terms do we put in the index?

Postings Faster merges: skip lists Positional postings and phrase queries

Recall basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

countryman

Documents tobe indexed.

Friends, Romans, countrymen.

Parsing a document

What format is it in? pdf/word/excel/html?

What language is it in? What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Complications: Format/language

Documents being indexed can include docs from many different languages A single index may have to contain terms of

several languages. Sometimes a document or its components

can contain multiple languages/formats French email with a German pdf

attachment.

Tokens and Terms

Tokenization

Input: “Friends, Romans and Countrymen”

Output: Tokens Friends Romans Countrymen

Each such token is now a candidate for an index entry, after further processing Described below

But what are valid tokens to emit?

Why tokenization is difficult – even in English

Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

Tokenize this sentence

One word or two? (or several)

fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares

Numbers

3/12/91 Mar. 12, 1991 55 B.C. B-52 My PGP key is 324a3df234cb23e (800) 234-2333

Often have embedded spaces Often, don’t index as text

But often very useful: think about things like looking up error codes on the web

(One answer is using n-grams: Lecture 3)

Tokenization: language issues

Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique

tokenization

Ambiguous segmentation in Chinese

The two characters can be treated as one

word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.

Normalization

Need to “normalize” terms in indexed text as well as query terms into the same form.

Example: We want to match U.S.A. and USA We most commonly implicitly define equivalence

classes of terms. Alternatively: do asymmetric expansion

window → window, windows windows → Windows, windows Windows (no expansion)

More powerful, but less efficient

Case folding

Reduce all letters to lower case exception: upper case in mid-sentence?

Fed vs. fed Often best to lower case everything, since

users will use lowercase regardless of ‘correct’ capitalization…

Lemmatization

Reduce inflectional/variant forms to base form

E.g., am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

Lemmatization implies doing “proper” reduction to dictionary headword form

Stemming Definition of stemming: Crude heuristic process

that chops off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge

Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping

language dependent e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Porter algorithm Most common algorithm for stemming English Results suggest that it is at least as good as other

stemming options Phases are applied sequentially Each phase consists of a set of commands.

Sample command: Delete final ement if what remains is longer than 1 character

replacement → replac cement → cement

Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

Porter stemmer: A few rules

Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat

Other stemmers

Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Single-pass, longest suffix removal (about 250 rules)

Full morphological analysis – at most modest benefits for retrieval

Do stemming and other normalizations help? English: very mixed results. Helps recall for some

queries but harms precision on others E.g., Porter Stemmer equivalence class oper

contains all of operate operating operates operation operative operatives operational

Definitely useful for Spanish, German, Finnish, …

Thesauri

Handle synonyms and homonyms Hand-constructed equivalence classes

e.g., car = automobile color = colour

Rewrite to form equivalence classes Index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

Or expand query? When the query contains automobile, look

under car as well

Stop words stop words = extremely common words which would

appear to be of little value in helping select documents matching a user need

They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he,

in, is, it, its, of, on, that, the, to, was, were, will, with Stop word elimination used to be standard in older IR

systems.

But the trend is away from doing this: Good compression techniques (lecture 5) means the space for

including stopwords in a system is very small Good query optimization techniques mean you pay little at

query time for including stop words. You need them for:

Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”

Most web search engines index stop words

Information Retrieval Lecture 2: The term vocabulary and postings lists.

Documents

Transcript of Information Retrieval Lecture 2: The term vocabulary and postings lists.

Introduction to Information Retrieval - Stanford University · Introduction to Information Retrieval Sec.13.6 ... black hole lists, etc. ... prediction: Predict if the ...

RELEASE 6.0E Support Package 2 - Vistex, Inc Postings ... Automatic Truncation in Membership ... The earlier generation Launchpad allowed users to configure organized lists of SAP

7/17/2019 Washington Library Association - Administration€¦ · 7/17/2019 · Association (PNLA) lists job openings frequently. You can also browse WLA institutional members' postings

Manufacturing Postings

CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.

Part 3: The term vocabulary, postings lists and tolerant ...

CS276A Information Retrieval Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure – Dictionary and Postings Key steps in construction.

COMP4201 Information Retrieval and Search Engines Lecture 2: The term vocabulary and postings lists.

SPA Postings

Introduction to Information Retrievaljg66/teaching/7312/notes/irch2.pdfIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 2 The term vocabulary and postings

Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings lists

Lecture 2: The term vocabulary and postings lists Related to Chapter 2:

Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.

EPA Postings

Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists 1.

CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists

Information Retrieval - SuanLabsuanlab.com/assets/lectures/ir/05.pdf · Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings

CSE 7/5337: Information Retrieval and Web Search The …michael.hahsler.net/SMU/CSE7337/slides/02voc.pdf · · 2012-01-04The term vocabulary and postings lists (IIR 2) ... Soundex:

Information Retrieval - Stanford University · 2019. 5. 14. · 6 Introduction to Information Retrieval Postings compression §The postings file is much larger than the dictionary,