Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system:...

26
CS490W: Web Information Search & Management CS-490W Web Information Search & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models Information Need Retrieval Model Representation Query Indexed Objects Retrieved Objects Evaluation/Feedback Representation

Transcript of Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system:...

Page 1: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

CS490W: Web Information Search & Management

CS-490WWeb Information Search & Management

Information Retrieval: Retrieval Models

Luo SiDepartment of Computer Science

Purdue University

Retrieval Models

Information Need

Retrieval Model

Representation

Query Indexed Objects

Retrieved Objects

Evaluation/Feedback

Representation

Page 2: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Overview of Retrieval Models

Retrieval ModelsBooleanVector space

Basic vector space SMART, LuceneExtended Boolean

Probabilistic modelsStatistical language models LemurTwo Possion model OkapiBayesian inference networks Inquery

Citation/Link analysis modelsPage rank GoogleHub & authorities Clever

Retrieval Models: Outline

Retrieval ModelsExact-match retrieval method

Unranked Boolean retrieval methodRanked Boolean retrieval method

Best-match retrieval methodVector space retrieval methodLatent semantic indexing

Page 3: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Unranked Boolean

Unranked Boolean: Exact match methodSelection Model

Retrieve a document iff it matches the precise queryOften return unranked documents (or with chronological order)

OperatorsLogical Operators: AND OR, NOTApproximately operators: #1(white house) (i.e., within one word distance, phrase) #sen(Iraq weapon) (i.e., within a sentence)String matching operators: Wildcard (e.g., ind* for india and indonesia)Field operators: title(information and retrieval)…

Retrieval Models: Unranked Boolean

Unranked Boolean: Exact match methodA query example(#2(distributed information retrieval) OR (#1 (federated search)) AND author(#1(Jamie Callan) AND NOT (Steve))

Page 4: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Unranked Boolean

WestLaw system: Commercial Legal/Health/Finance Information Retrieval SystemLogical operatorsProximity operators: Phrase, word proximity, same sentence/paragraphString matching operator: wildcard (e.g., ind*)Field operator: title(#1(“legal retrieval”)) date(2000)Citations: Cite (Salton)

Retrieval Models: Unranked Boolean

Advantages:Work well if user knows exactly what to retrievePredicable; easy to explainVery efficient

Disadvantages:It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict queryResults are unordered; hard to find useful onesUsers may be too optimistic for strict queries. A few very relevant but a lot more are missing

Page 5: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Ranked Boolean

Ranked Boolean: Exact matchSimilar as unranked Boolean but documents are ordered by some criterion

Reflect importance of document by its words

Query: (Thailand AND stock AND market)

Retrieve docs from Wall Street Journal Collection

Which word is more important?

Term Frequency (TF): Number of occurrence in query/doc; larger number means more important

Inversed Document Frequency (IDF): Larger means more important

Total number of docs Number of docs contain a term

There are many variants of TF, IDF: e.g., consider document length

Many “stock” and “market”, but fewer “Thailand”. Fewer may be more indicative

Retrieval Models: Ranked Boolean

Ranked Boolean: Calculate doc scoreTerm evidence: Evidence from term i occurred in doc j: (tfij) and (tfij*idfi)AND weight: minimum of argument weightsOR weight: maximum of argument weights

Term evidence

0.2 0.6 0.4

ANDMin=0.2

0.2 0.6 0.4

ORMax=0.6

Query: (Thailand AND stock AND market)

Page 6: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Ranked Boolean

Advantages:All advantages from unranked Boolean algorithm

Works well when query is precise; predictive; efficient

Results in a ranked list (not a full list); easier to browse and find the most relevant ones than BooleanRank criterion is flexible: e.g., different variants of term evidence

Disadvantages:Still an exact match (document selection) model: inverse correlation for recall and precision of strict and loose queriesPredictability makes user overestimate retrieval quality

Retrieval Models: Vector Space Model

Vector space modelAny text object can be represented by a term vector

Documents, queries, passages, sentencesA query can be seen as a short document

Similarity is determined by distance in the vector spaceExample: cosine of the angle between two vectors

The SMART systemDeveloped at Cornell University: 1960-1999Still quite popular

The Lucene systemOpen source information retrieval library; (Based on Java)Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon Book)

Page 7: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Vector space model vs. Boolean modelBoolean models

Query: a Boolean expression that a document must satisfyRetrieval: Deductive inference

Vector space modelQuery: viewed as a short document in a vector spaceRetrieval: Find similar vectors/objects

Retrieval Models: Vector Space Model

Vector representation

Page 8: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Vector representationJava

Sun

Starbucks

D2

D3 D1

Query

Retrieval Models: Vector Space Model

Give two vectors of query and documentquery as document ascalculate the similarity

1 2( , ,..., )nq q q q=

1 2( , ,..., )j j j jnd d d d=

Cosine similarity: Angle between vectors

1 ,1 2 ,2 , 1 ,1 2 ,2 ,

2 2 2 21 1

cos( ( , ))... ...

... ...

j

j j j j j n j j j j n

n j jn

q dq d q d q d q d q d q dq d

q d q d q q d d

θ+ + + + + +

= = =+ + + +

i

( , )jq dθ

q

jd( , ) cos( ( , ))j jsim q d q dθ=

Page 9: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Vector representation

Retrieval Models: Vector Space Model

Vector CoefficientsThe coefficients (vector elements) represent term evidence/ term importance It is derived from several elements

Document term weight: Evidence of the term in the document/query Collection term weight: Importance of term from observation of collectionLength normalization: Reduce document length bias

Naming convention for coefficients:

,. .k j kq d DCL DCL= First triple represents query term; second for document term

Page 10: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Common vector weight components:lnc.ltc: widely used term weight

“l”: log(tf)+1“n”: no weight/normalization“t”: log(N/df)“c”: cosine normalization

( )( )

( )[ ] ( )2

2

2211

)(log1)(log(1)(log(

)(log1)(log(1)(log(

..

∑∑

⎥⎦

⎤⎢⎣

⎡++

⎥⎦

⎤⎢⎣

⎡++

=++

kj

kq

kjq

j

jnnjj

kdfNktfktf

kdfNktfktf

dq

dqdqdq

Retrieval Models: Vector Space Model

Common vector weight components:dnn.dtb: handle varied document lengths

“d”: 1+ln(1+ln(tf))“t”: log((N/df)“b”: 1/(0.8+0.2*docleng/avg_doclen)

Page 11: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Standard vector spaceRepresent query/documents in a vector spaceEach dimension corresponds to a term in the vocabularyUse a combination of components to represent the term evidence in both query and documentUse similarity function to estimate the relationship between query/documents (e.g., cosine similarity)

Retrieval Models: Vector Space Model

Advantages:Best match method; it does not need a precise queryGenerated ranked lists; easy to explore the resultsSimplicity: easy to implementEffectiveness: often works wellFlexibility: can utilize different types of term weighting methodsUsed in a wide range of IR tasks: retrieval, classification, summarization, content-based filtering…

Page 12: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations

Choose of term weightsChoose of similarity function

Assume a query and a document can be treated in the same way

Retrieval Models: Vector Space Model

Disadvantages:Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choiceAssume independent relationship among termsHeuristic for choosing vector operations

Choose of term weightsChoose of similarity function

Assume a query and a document can be treated in the same way

Page 13: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Vector Space Model

What are good vector representation:Orthogonal: the dimensions are linearly independent (“no overlapping”)No ambiguity (e.g., Java)Wide coverage and good granularityGood interpolations (e.g., representation of semantic meaning)Many possibilities: words, stemmed words, “latent concepts”….

Retrieval Models: Latent Semantic Indexing

Dual space of terms and documents

Page 14: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

Latent Semantic Indexing (LSI): Explore correlation between terms and documents

Two terms are correlated (may share similar semantic concepts) if they often co-occurTwo documents are correlated (share similar topics) if they have many common words

Latent Semantic Indexing (LSI): Associate each term and document with a small number of semantic concepts/topics

Retrieval Models: Latent Semantic Indexing

Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics

Representation of concept in document space; VTV=Im

Representation of concept in term space; UTU=Im

Diagonal matrix: concept space

X=USVT

UTU=ImVTV=Im

Page 15: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

Using singular value decomposition (SVD) to find the small set of concepts/topicsm: number of concepts/topics

Representation of document in concept space

Representation of term in concept space

Diagonal matrix: concept space

X=USVT

UTU=ImVTV=Im

Retrieval Models: Latent Semantic Indexing

Properties of Latent Semantic Indexing

Diagonal elements of S as Sk in descending order, the larger the more important

is the rank-k matrix that best approximates X, where uk and vk are the column vector of U and V

'k k k k

i kx u S v

= ∑

Page 16: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

Other properties of Latent Semantic Indexing

The columns of U are eigenvectors of XXT

The columns of V are eigenvectors of XTXThe singular values on the diagonal of S, are the positive

square roots of the nonzero eigenvalues of both AAT and ATA

Retrieval Models: Latent Semantic Indexing

X X

Page 17: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

X X

Retrieval Models: Latent Semantic Indexing

X X

Page 18: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

X X

Importance of concepts

Size of Sk

Importance of Concept

Reflect Error of Approximating X

with small S

Retrieval Models: Latent Semantic Indexing

Page 19: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

SVD representationReduce high dimensional representation of document or query intolow dimensional concept spaceSVD tries to preserve the Euclidean distance of document/term vector

Concept 1

Concept 2

Retrieval Models: Latent Semantic Indexing

C1 C2

SVD representation

Representation of the documents in two dimensional concept space

Retrieval Models: Latent Semantic Indexing

B

C

Page 20: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

SVD representation

Representation of the terms in two dimensional concept space

Retrieval Models: Latent Semantic Indexing

B

C

Retrieval Models: Latent Semantic Indexing

Retrieval with respect to a queryMap (fold-in) a query into the representation of the concept space ' ( )

T

k kq q U Inv S=

Use the new representation of the query to calculate the similarity between query and all documents

Cosine Similarity

Page 21: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

Qry: Machine Learning Protein

Representation of the query in the term vector space:

[0 0 1 1 0 1 0 0 0]T

Retrieval Models: Latent Semantic Indexing

' ( )T

k kq q U Inv S=

Representation of the query in the latent semantic space (2 concepts):

=[-0.3571 0.1635]T

B

C

Query

Page 22: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Latent Semantic Indexing

Comparison of Retrieval Results in term space and concept space

Qry: Machine Learning Protein

Retrieval Models: Latent Semantic Indexing

Problems with latent semantic indexingDifficult to decide the number of conceptsThere is no probabilistic interpolation for the resultsThe complexity of the LSI model obtained from SVD is

costly

Page 23: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Language Models: Motivation

Vector space model for information retrievalDocuments and queries are vectors in the term spaceRelevance is measure by the similarity between document vectors and query vector

Problems for vector space modelAd-hoc term weighting schemesAd-hoc similarity measurement

No justification of relationship between relevance and similarity

We need more principled retrieval models…

Introduction to Language Models:

Language model can be created for any language sampleA documentA collection of documentsSentence, paragraph, chapter, query…

The size of language sample affects the quality of language model

Long documents have more accurate modelShort documents have less accurate modelModel for sentence, paragraph or query may not be reliable

Page 24: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Introduction to Language Models:

A document language model defines a probability distribution over indexed terms

E.g., the probability of generating a termSum of the probabilities is 1

A query can be seen as observed data from unknown modelsQuery also defines a language model

How might the models be used for IR?Rank documents by Pr( | )idq

Multinomial/Unigram Language Models

Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary

id

Examples:Five words in vocabulary (sport, basketball, ticket, finance, stock)For a document , its language mode is:

{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}

Formally:The language model is: {Pi(w) for any word w in vocabulary V}

( ) 1 0 ( ) 1i k i kk

P w P w= ≤ ≤∑

Page 25: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Language Model for IR: Example

Estimating language model for each document

sport, basketball, ticket, sport

1dbasketball, ticket, finance, ticket, sport

2dstock, finance, finance, stock

3d

Language Model for 1d

Language Model for 2d

Language Model for 3d

Estimate the generation probability of Pr( | )q idq

sport, basketball

Generate retrieval results

Estimating language model for each document

2dbasketball, ticket, finance, ticket, sport

(psp, pb, pt, pf, pst) =(0.2,0.2,0.4,0.2,0)

Maximum Likelihood Estimation (MLE)

= ?

For query “basketball ticket”

Page 26: Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Retrieval Models: Outline

Retrieval ModelsExact-match retrieval method

Unranked Boolean retrieval methodRanked Boolean retrieval method

Best-match retrievalVector space retrieval methodLatent semantic indexingLanguage Modeling Approach