Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding...

29
Information Retrieval in Text Part II • Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. • Reading Assignment: Chapter 3.
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding...

Page 1: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Information Retrieval in TextPart II

• Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999.

• Reading Assignment: Chapter 3.

Page 2: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Traditional Information Retrieval vs.

Vector Space Models• Traditional lexical (or Boolean) IR techniques, e.g. keyword

matching, often return too much and irrelevant information to user due to

• Synonymy (multiple ways to express a concept), and • Polysemy (multiple meanings of a word/words).

– If the terms used in the query differ from the terms used in the document, valuable information can never be found.

• Vector space models can be used to encode/represent terms and documents in a text collection– Each component of a document vector can be used to represent a

particular term/key word/phrase/concept– Value assigned reflects the semantic importance of the term/key

word/phrase/concept

Page 3: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Vector Space Construction

• A document collection of n documents indexed using m terms is represented as an m n term-by-document matrix A.– aij is the weighted frequency at which term i

occurs in document j.– Columns are document vectors and rows are

term vectors

Page 4: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Vector Space Properties

• The document vectors span the content, yet it is not the case that every vector represented in the column space of A has a specific interpretation.

– For example, a linear combination of any two document vectors does not necessarily represent a viable document of the collection

• However, the vector space model can exploit geometric relationships between document (and term) vectors in order to explain both similarities and differences in concepts.

Page 5: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term-by-Document Matrices

• What is the relationship between the number of terms, m, and number of documents, n, in– Heterogeneous text collections such as

newspapers and encyclopedias?– WWW?

• 300,000 300,000,000 (as of late 90’s)

Page 6: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term-by-Document Matrices: Example

Page 7: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term-by-Document Matrices: Example

Page 8: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term-by-Document Matrices: Example

• Although the frequency of terms is 1, it can be more– Matrix entries can be scaled so that the Euclidean norm of each

document vector is equal to 1:

• Determining which words to index and which words to discard defines both the art and the science of automated indexing.

• Terms are usually identified by their word stems.– Stemming reduces the number of rows in the term by document

matrix, which may lead to storage savings

m

ii

T xxxx1

2

2

Page 9: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching

• Queries are represented as m 1 vectors– So, in the previous example, a query of “Child

Proofing” is represented as …?

• Query matching in the vector space model can be viewed as a search in the column space of Matrix A (i.e. the subspace spanned by the document vectors) for the documents most similar to the query.

Page 10: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching

• A most commonly used similarity measure is to– find the cosine of the angle between the query vector

and all document vectors (assume aj is the jth document vector)

– Given a threshold value, T, documents that satisfy the condition |cos j| ≥ T are judged as relevant to the user’s query and returned.

njqa

qa

j

Tj

j ,...,2,1,cos22

Page 11: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching

• Properties– Since A is sparse, dot product computation is

inexpensive.– Document vector norms can be pre-computed

and stored before any cosine computation.• No need to do that if the matrix A is normalized

– If both the query and document vectors are normalized, the cosine computation constitutes a single inner product.

Page 12: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching: Example

• For our previous query “Child Proofing”, and assuming that T = 0.5– The nonzero cosines are cos 2 = cos 3 = 0.4082,

and cos 5 = cos 6 = 0.5

– Therefore, the only “relevant” documents are • Baby Proofing Basics

• Your Guide to Easy Rust Proofing

– Documents 1 to 4 have been incorrectly ignored, whereas Document 7 have been correctly ignored.

Page 13: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching: Example

• What about a query on Child Home Safety?– What is the query?– What are the nonzero cosines?– What is retrieved with T = 0.5?

Page 14: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Simple Query Matching

• Hence, the current vector space model representation and query technique do not accurately represent and/or retrieve semantic content of the book titles.

• The following approaches have been developed to address errors in this model– Term Weighting– Low-rank approximations to the original term

by document matrix A

Page 15: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting• The main objective of term weighting is to improve retrieval

performance

• Term-document matrix A entries aij are redefined to be as follows:

aij = lijgidj

• lij is the local weight for term i occurring in document j

• gi is the global weight for term I in the collection

• dj is a document normalization factor that specifies whether or not the columns of A are normalized.

– Define fij as the frequency that term i appears in document j, and let

0 if0

0 if1)( and,

r

rr

f

fp

jij

ijij

Page 16: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting

Page 17: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting

• A simple notation for specifying a term weighting approach is to use the three-letter string associated with the particular local, global, and normalization factor symbols.– For example, the lfc weighting scheme defines

Page 18: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting

• Defining an appropriate weighting scheme depends on certain characteristics of the document collection– The choice for the local weight (lij) may depend on the

vocabulary or word usage patterns for the collection• For technical/scientific vocabularies (technical reports, journal

articles), schemes of the form nxx are recommended

• For more general/varied vocabularies (e.g. popular magazines, encyclopedias), simple term frequencies (t**) may be sufficient.

• Binary term frequencies (b**) are useful when the term list is relatively short (e.g. controlled frequencies)

Page 19: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting

• Defining an appropriate weighting scheme depends on certain characteristics of the document collection– The choice for the global weight (gi) should take into account how

often the collection is likely to change, called the state of the document collection

• For dynamic collections, one may disregard the global factor altogether (*x*)

• For static collections, the inverse document frequency (IDF) global weight (*f*) is a common choice among automatic indexing schemes.

– The probability that a document being judged relevant by a user significantly increases with the document length, i.e. the longer the document is, the more likely all keywords will be found.

• Traditional cosine normalization (**c) has not been effective for large full text documents (e.g. TREC-4).

• Instead, a pivoted-cosine normalization scheme has been proposed for indexing the TREC collections.

Page 20: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Term Weighting: Pivoted [Cosine] Normalization

The normalization factor for documents for which Pretrieval > Prelevance is increased whereas the normalization factor for documents for which Pretrieval < Prelevance is decreased

Pivoted normalization = (1 – slope) pivot + slope old normalization

If deviation of the retrieval pattern from the relevancepattern is systematic across collections for a normalizationfunction, the pivot and the slope values learned from one collection can be used effectively on another collection

Page 21: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Sparse Matrix Storage• Although A is sparse, it does not generally

exhibit a structure or pattern, such as banded matrices.– This implies that it is difficult to identify clusters of

documents sharing similar terms.• Some progress in reordering hypertext-based matrices has

been reported

Page 22: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Sparse Matrix Storage

• Two formats suitable for the term-by-document matrices are Compressed Row Storage (CRS) and Compressed Column Storage (CCS)– No assumptions on the existence of a pattern or

structure of the sparse matrix– Require 3 arrays of storage

Page 23: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Compressed Row Storage

• One floating-point array (val) for storing the nonzero values, i.e. [un]weighted term frequencies, of A.– Row-wise

• Two integer arrays for indices (col_index, row_ptr)– col_index: corresponding column indices of the

elements in the val array (What is the size of this array)?

• Val (k) = aij col_index = k

– row_ptr: locations in the val array that begin a row

Page 24: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Compressed Row Storage: Example

val:col_index:row_ptr:

00045.0007.0

00045.058.000

07.07.00000

00045.0007.0

000058.058.00

00045.0000

7.07.000000

000058.058.00

7.007.045.0058.00

Page 25: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Compressed Column Storage

• Known as Harwell-Boeing sparse matrix format

• Almost identical to CRS– Columns are stored in contiguous array

locations

Page 26: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Compressed Column Storage: Example

val:row_index:col_ptr:

00045.0007.0

00045.058.000

07.07.00000

00045.0007.0

000058.058.00

00045.0000

7.07.000000

000058.058.00

7.007.045.0058.00

Page 27: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Low-Rank Approximations

• The uncertainties associated with term-by-document matrices can be attributed differences in language (word usage) culture.– For example, the author may use different

words than the reader/searcher.• Will we have ever a perfect term-by-document

matrix representing all possible term-document associations?

– Errors in measurement can accumulate and lead to those uncertainties.

Page 28: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Low-Rank Approximations• Hence, the term-by-document matrix may be

represented by the matrix sum A + E, where E reflects the error or uncertainty in assigning (or generating) the elements of matrix A.

• Current approaches to information retrieval without requiring literal word matches have focused on the use of rank-k approximations to term-by-document matrices. – Latent Semantic Indexing (k << min(m, n))

• Coordinates produced by low-rank approximations do not explicitly reflect term frequencies within documents, instead they model global usage patterns of terms so that related documents are represented by nearby vectors in the k-dimensional space.

– Semi-discrete Decomposition

Page 29: Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Low-Rank Approximations