Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding...

Information Retrieval in TextPart II

• Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999.

• Reading Assignment: Chapter 3.

Traditional Information Retrieval vs.

Vector Space Models• Traditional lexical (or Boolean) IR techniques, e.g. keyword

matching, often return too much and irrelevant information to user due to

• Synonymy (multiple ways to express a concept), and • Polysemy (multiple meanings of a word/words).

– If the terms used in the query differ from the terms used in the document, valuable information can never be found.

• Vector space models can be used to encode/represent terms and documents in a text collection– Each component of a document vector can be used to represent a

particular term/key word/phrase/concept– Value assigned reflects the semantic importance of the term/key

word/phrase/concept

Vector Space Construction

• A document collection of n documents indexed using m terms is represented as an m n term-by-document matrix A.– aij is the weighted frequency at which term i

occurs in document j.– Columns are document vectors and rows are

term vectors

Vector Space Properties

• The document vectors span the content, yet it is not the case that every vector represented in the column space of A has a specific interpretation.

– For example, a linear combination of any two document vectors does not necessarily represent a viable document of the collection

• However, the vector space model can exploit geometric relationships between document (and term) vectors in order to explain both similarities and differences in concepts.

Term-by-Document Matrices

• What is the relationship between the number of terms, m, and number of documents, n, in– Heterogeneous text collections such as

newspapers and encyclopedias?– WWW?

• 300,000 300,000,000 (as of late 90’s)

Term-by-Document Matrices: Example

Term-by-Document Matrices: Example

• Although the frequency of terms is 1, it can be more– Matrix entries can be scaled so that the Euclidean norm of each

document vector is equal to 1:

• Determining which words to index and which words to discard defines both the art and the science of automated indexing.

• Terms are usually identified by their word stems.– Stemming reduces the number of rows in the term by document

matrix, which may lead to storage savings

m

ii

T xxxx1

2

2

Simple Query Matching

• Queries are represented as m 1 vectors– So, in the previous example, a query of “Child

Proofing” is represented as …?

• Query matching in the vector space model can be viewed as a search in the column space of Matrix A (i.e. the subspace spanned by the document vectors) for the documents most similar to the query.


• A most commonly used similarity measure is to– find the cosine of the angle between the query vector

and all document vectors (assume aj is the jth document vector)

– Given a threshold value, T, documents that satisfy the condition |cos j| ≥ T are judged as relevant to the user’s query and returned.

njqa

qa

j

Tj

j ,...,2,1,cos22


• Properties– Since A is sparse, dot product computation is

inexpensive.– Document vector norms can be pre-computed

and stored before any cosine computation.• No need to do that if the matrix A is normalized

– If both the query and document vectors are normalized, the cosine computation constitutes a single inner product.

Simple Query Matching: Example

• For our previous query “Child Proofing”, and assuming that T = 0.5– The nonzero cosines are cos 2 = cos 3 = 0.4082,

and cos 5 = cos 6 = 0.5

– Therefore, the only “relevant” documents are • Baby Proofing Basics

• Your Guide to Easy Rust Proofing

– Documents 1 to 4 have been incorrectly ignored, whereas Document 7 have been correctly ignored.

Simple Query Matching: Example

• What about a query on Child Home Safety?– What is the query?– What are the nonzero cosines?– What is retrieved with T = 0.5?


• Hence, the current vector space model representation and query technique do not accurately represent and/or retrieve semantic content of the book titles.

• The following approaches have been developed to address errors in this model– Term Weighting– Low-rank approximations to the original term

by document matrix A

Term Weighting• The main objective of term weighting is to improve retrieval

performance

• Term-document matrix A entries aij are redefined to be as follows:

aij = lijgidj

• lij is the local weight for term i occurring in document j

• gi is the global weight for term I in the collection

• dj is a document normalization factor that specifies whether or not the columns of A are normalized.

– Define fij as the frequency that term i appears in document j, and let

0 if0

0 if1)( and,

r

rr

f

fp

jij

ijij

Term Weighting

Term Weighting

• A simple notation for specifying a term weighting approach is to use the three-letter string associated with the particular local, global, and normalization factor symbols.– For example, the lfc weighting scheme defines

Term Weighting

• Defining an appropriate weighting scheme depends on certain characteristics of the document collection– The choice for the local weight (lij) may depend on the

vocabulary or word usage patterns for the collection• For technical/scientific vocabularies (technical reports, journal

articles), schemes of the form nxx are recommended

• For more general/varied vocabularies (e.g. popular magazines, encyclopedias), simple term frequencies (t**) may be sufficient.

• Binary term frequencies (b**) are useful when the term list is relatively short (e.g. controlled frequencies)

Term Weighting

• Defining an appropriate weighting scheme depends on certain characteristics of the document collection– The choice for the global weight (gi) should take into account how

often the collection is likely to change, called the state of the document collection

• For dynamic collections, one may disregard the global factor altogether (*x*)

• For static collections, the inverse document frequency (IDF) global weight (*f*) is a common choice among automatic indexing schemes.

– The probability that a document being judged relevant by a user significantly increases with the document length, i.e. the longer the document is, the more likely all keywords will be found.

• Traditional cosine normalization (**c) has not been effective for large full text documents (e.g. TREC-4).

• Instead, a pivoted-cosine normalization scheme has been proposed for indexing the TREC collections.

Term Weighting: Pivoted [Cosine] Normalization

The normalization factor for documents for which Pretrieval > Prelevance is increased whereas the normalization factor for documents for which Pretrieval < Prelevance is decreased

Pivoted normalization = (1 – slope) pivot + slope old normalization

If deviation of the retrieval pattern from the relevancepattern is systematic across collections for a normalizationfunction, the pivot and the slope values learned from one collection can be used effectively on another collection

Sparse Matrix Storage• Although A is sparse, it does not generally

exhibit a structure or pattern, such as banded matrices.– This implies that it is difficult to identify clusters of

documents sharing similar terms.• Some progress in reordering hypertext-based matrices has

been reported

Sparse Matrix Storage

• Two formats suitable for the term-by-document matrices are Compressed Row Storage (CRS) and Compressed Column Storage (CCS)– No assumptions on the existence of a pattern or

structure of the sparse matrix– Require 3 arrays of storage

Compressed Row Storage

• One floating-point array (val) for storing the nonzero values, i.e. [un]weighted term frequencies, of A.– Row-wise

• Two integer arrays for indices (col_index, row_ptr)– col_index: corresponding column indices of the

elements in the val array (What is the size of this array)?

• Val (k) = aij col_index = k

– row_ptr: locations in the val array that begin a row

Compressed Row Storage: Example

val:col_index:row_ptr:

00045.0007.0

00045.058.000

07.07.00000

00045.0007.0

000058.058.00

00045.0000

7.07.000000

000058.058.00

7.007.045.0058.00

Compressed Column Storage

• Known as Harwell-Boeing sparse matrix format

• Almost identical to CRS– Columns are stored in contiguous array

locations

Compressed Column Storage: Example

val:row_index:col_ptr:

00045.0007.0

00045.058.000

07.07.00000

00045.0007.0

000058.058.00

00045.0000

7.07.000000

000058.058.00

7.007.045.0058.00

Low-Rank Approximations

• The uncertainties associated with term-by-document matrices can be attributed differences in language (word usage) culture.– For example, the author may use different

words than the reader/searcher.• Will we have ever a perfect term-by-document

matrix representing all possible term-document associations?

– Errors in measurement can accumulate and lead to those uncertainties.

Low-Rank Approximations• Hence, the term-by-document matrix may be

represented by the matrix sum A + E, where E reflects the error or uncertainty in assigning (or generating) the elements of matrix A.

• Current approaches to information retrieval without requiring literal word matches have focused on the use of rank-k approximations to term-by-document matrices. – Latent Semantic Indexing (k << min(m, n))

• Coordinates produced by low-rank approximations do not explicitly reflect term frequencies within documents, instead they model global usage patterns of terms so that related documents are represented by nearby vectors in the k-dimensional space.

– Semi-discrete Decomposition

Low-Rank Approximations

Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding...

Documents

Transcript of Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding...