Language Models for Information Retrieval

References:1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 20032. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, January-

February 20013. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval,

Cambridge University Press, 2008. (Chapter 12)4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2)

Berlin ChenDepartment of Computer Science & Information Engineering

National Taiwan Normal University

IR – Berlin Chen 2

Taxonomy of Classic IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: AdhocFiltering

Browsing

Classic Models

BooleanVectorProbabilistic

Set Theoretic

FuzzyExtended Boolean

Probabilistic

Inference Network Belief Network

Language Model-Probabilistic LSI-Topical Mixture Model

Algebraic

Generalized VectorNeural Networks

Browsing

FlatStructure GuidedHypertext

Statistical Language Models (1/2)

• A probabilistic mechanism for “generating” a piece of text– Defines a distribution over all possible word sequences

• What is LM Used for ?– Speech recognition– Spelling correction– Handwriting recognition– Optical character recognition– Machine translation– Document classification and routing– Information retrieval …

NwwwW K21 =

( ) ?=WP

Statistical Language Models (2/2)

• (Statistical) language models (LM) have been widely used for speech recognition and language (machine) translation for more than twenty years

• However, their use for use for information retrieval started only in 1998 [Ponte and Croft, SIGIR 1998]

Query Likelihood Language Models

• Documents are ranked based on Bayes (decision) rule

– is the same for all documents, and can be ignored

– might have to do with authority, length, genre, etc.• There is no general way to estimate it• Can be treated as uniform across all documents

• Documents can therefore be ranked based on

– The user has a prototype (ideal) document in mind, and generates a query based on words that appear in this document

– A document is treated as a model to predict (generate) the query

( ) ( ) ( )( )QP

DPDQPQDP =

( )QP( )DP

( ) ( )( ) M as denotedor DQPDQP

Schematic Depiction

DocumentCollection

query (Q)

P(Q|MD1 )

P(Q|MD3)

DocumentModels

n-grams• Multiplication (Chain) rule

– Decompose the probability of a sequence of events into the probability of each successive events conditioned on earlier events

• n-gram assumption– Unigram

• Each word occurs independently of the other words• The so-called “bag-of-words” model

– Bigram

– Most language-modeling work in IR has used unigram language models

• IR does not directly depend on the structure of sentences

( ) ( ) ( ) ( ) ( )12121312121 .... −= NNN wwwwPwwwPwwPwPwwwP KL

( ) ( ) ( ) ( ) ( )NN wPwPwPwPwwwP L32121 .... =

( ) ( ) ( ) ( ) ( )12312121 .... −= NNN wwPwwPwwPwPwwwP L

Unigram Model (1/4)

• The likelihood of a query given a document

– Words are conditionally independent of each other given the document

– How to estimate the probability of a (query) word given the document ?

• Assume that words follow a multinomial distributiongiven the document

( ) ( ) ( ) ( )( )∏=

=Ni Di

wPwPwPQP

MM MM L

NwwwQ .... 21=D

( ) ( )( ) ( )( )( )( )

( )( ) ∑ ==

∏∏

Vi wDiw

wCwCwC

occurs worda timesofnumber the: where!!

M,...,

permutation is considered here

( ) M DwP

Unigram Model (2/4)

• Use each document itself a sample for estimating its corresponding unigram (multinomial) model– If Maximum Likelihood Estimation (MLE) is adopted

P(wb|MD)=0.3

P(wc |MD)=0.2P(wd |MD)=0.1

P(we |MD)=0.0P(wf |MD)=0.0

( ) ( )

( )( ) DDwCDD

DDwCwP

, , oflength :in occurs timesofnumber :,

The zero-probability problemIf we and wf do not occur in Dthen P(we |MD)= P(wf |MD)=0

This will cause a problem in predicting the query likelihood (See the equation for the query likelihood in the preceding slide)

Unigram Model (3/4)

• Smooth the document-specific unigram model with a collection model (a mixture of two multinomials)

• The role of the collection unigram model– Help to solve zero-probability problem– Help to differentiate the contributions of different missing terms in

a document (global information like IDF ? )

• The collection unigram model can be estimated in a similar way as what we do for the document-specific unigram model

( ) ( ) ( ) ( )[ ]∏ ⋅−+⋅= =Ni CiDiD wPwPQP 1 M1MM λλ

( )CiwP M

Unigram Model (4/4)

• An evaluation on the Topic Detection and Tracking (TDT) corpra– Language Model

– Vector Space Model

mAP Unigram Unigram+Bigram

TQ/TD 0.6327 0.5427

TDT2 TQ/SD 0.5658 0.4803

TQ/TD 0.6569 0.6141

TDT3 TQ/SD 0.6308 0.5808

mAP Unigram Unigram+Bigram

TQ/TD 0.5548 0.5623

TDT2 TQ/SD 0.5122 0.5225

TQ/TD 0.6505 0.6531

TDT3 TQ/SD 0.6216 0.6233

( )( )[ ( ) ( )]

1 CiNi Di

DUnigram

MwPMwP

⋅−∏ +⋅= = λλ

( )( ) ( )[( )

( ) ( )]Cii

Ni CiDi

DBigramUnigram

MwPMwP

⋅−−−

∏ ⋅+⋅=

λλλ

Maximum Mutual Information

• Documents can be ranked based their mutual information with the query

• Document ranking by mutual information (MI) is equivalent that by likelihood

( ) ( )( ) ( )( ) ( )QPDQP

DPQPDQPDQMI

loglog

being the same for all documents, and hence can be ignored

( ) ( )DQPDQMIDDmaxarg,maxarg =

Probabilistic Latent Semantic Analysis (PLSA)

• Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI)– Graphical Model Representation (a kind of Bayesian Networks)

Reference:1. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, January-February 2001

Thomas Hofmann 1999

D T Q( )DP ( )DTP k ( )kTwP

D Q( )DP ( )DwP

Language model

( ) ( ) ( ) ( )( )

( ) ( )( )

( ) ( ) ( )[ ] ( )∏ ⋅−+⋅=

QwCCD MwPMwP

DPDQPQPDPDQP

QDPDQsim

( ) ( ) ( ) ( )

( )( )

( ) ( )( )QwC

DTPTwP

DwPDQPDQsim

∏ ⎥⎦⎤

⎢⎣⎡∑=

∏ ⎥⎦⎤

⎢⎣⎡∑=

The latent variables=>The unobservable class variables Tk(topics or domains)

PLSA: Formulation

• Definition– : the prob. when selecting a doc

– : the prob. when pick a latent class for the doc

– : the prob. when generating a word from the class

( )DP D

D( )DTP k kT

( )kTwP kTw

PLSA: Assumptions• Bag-of-words: treat docs as memoryless source, words

are generated independently

• Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable

( ) ( ) ( )kkk TDPTwPTDwP ≈,

( ) ( ) ( ) ( )∏==w

QwCDwPDQPDQsim ,,

( ) ( ) ( )( )

( ) ( )( )

( ) ( ) ( )( )

( ) ( )( )

( ) ( )∑=

∑=∑=

∑=∑=∑=

DTPTwP

DPDTPTwP

DPTPTDPTwP

DPTPTDwP

DPTDwPDTwPDwP

PLSA: Training (1/2)

• Probabilities are estimated by maximizing the collection likelihood using the Expectation-Maximization (EM) algorithm

( ) ( )

( ) ( ) ( )∑ ∑ ⎥⎦

⎤⎢⎣

⎡∑=

∑ ∑=

D w Tkk

DTPTwPDwC

DwPDwCL

EM tutorial:- Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Application

to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021

PLSA: Training (2/2)

• E (expectation) step

• M (Maximization) step

( ) ( ) ( )( ) ( )∑ ∑

D kk DwTPDwC

DwTPDwCTwP

,,,,ˆ

( ) ( ) ( )( )∑ ′

kwik DwC

DwTPDwCDTP

( ) ( ) ( )( ) ( )∑

==kT kk

kkk DTPTwP

DTPTwPDwTP ,

PLSA: Latent Probability Space (1/2)

image sequenceanalysis

medical imagingcontext of contourboundary detection

phonetic segmentation

( ) ( ) ( ) ( )

( ) ( ) ( )kikT

TDPTPTwP

DTPDTwPDTwPDwP

∑∑=

( )( )kjkj TwP

,:U ( )( )kkTPdiag:Σ ( )( )

kiki TDP,

:V( )DWP , .= .

Dimensionality K=128 (latent classes)

PLSA: Latent Probability Space (2/2)

D1 D2 Di Dn

P Umxk

Σk VTkxn

t1…tk… tK

( ) ( ) ( ) ( )kikT

kjij TDPTPTwPDwPk

( )kj TwP

( )kTPw1w2

( )ki TDP

( )ij DwP ,

D1 D2 Di Dn

PLSA: One more example on TDT1 dataset

aviation space missions family love Hollywood love

PLSA: Experiment Results (1/4)

• Experimental Results – Two ways to smoothen empirical distribution with PLSA

• Combine the cosine score with that of the vector space model (so does LSA)PLSA-U* (See next slide)

• Combine the multinomials individually PLSA-Q*

Both provide almost identical performance– It’s not known if PLSA was used alone

)|()1()|()|(* DwPDwPDwP PLSAEmpiricalQPLSA ⋅−+⋅=− λλ

( )( )DcDwcDwPEmpirical

,)|( =

( ) ( ) ( )∑==

kkkPLSA DTPTwPDwP

( ) ( )DwcQw

PLSAEmpiricalQPLSA DwPDwPDQP ,* )|()1()|()|( ∏ ⋅−+⋅=

∈− λλ

PLSA-U*• Use the low-dimensional representation and

(be viewed in a k-dimensional latent space) to evaluate relevance by means of cosine measure

• Combine the cosine score with that of the vector space model

• Use the ad hoc approach to re-weight the different model components (dimensions) by

)|( QTP k )|( DTP k

( ) ( )

( ) ( )∑∑

∑=−

UPLSADTPQTP

DTPQTPDQR

22* ),(( )

( ) ( )( )

,, where,

∑ ′

∈′

QwTPQwCQTP

( ) ),(1),(),(~** DQRDQRDQR VSMUPLSAUPLSA

rv⋅−+⋅= −− λλ

online folded-in

• Why ?

– Reminder that in LSA, the relations between any two docs can be formulated as

– PLSA mimics LSA in similarity measure

( ) ( )

( ) ( )∑∑

∑=−

iQPLSIDTPQTP

DTPQTPDQR

22* ),(

A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

( ) ( ) ( ) ( )

( ) ( )[ ] ( ) ( )[ ]( ) ( ) ( ) ( )

( ) ( )[ ] ( ) ( )[ ]( ) ( )

( ) ( )∑∑

∑∑

∑=−

ksskiik

kkskkki

siQPLSI

DTPDTP

DPDTPDPDTP

TPTDPTPTDP

TDPTPTPTDPDDR

vectorsrow are ˆ and ˆ

ˆˆˆˆ

)ˆ,ˆ(,2

DDDDDDcoineDDsimΣΣ

Σ=ΣΣ=

( ) ( ) ( ) ( )iikkki DPDTPTPTDP =

PLSA vs. LSA

• Decomposition/Approximation– LSA: least-squares criterion measured on the L2- or Frobenius

norms of the word-doc matrices– PLSA: maximization of the likelihoods functions based on the

cross entropy or Kullback-Leibler divergence between the empirical distribution and the model

• Computational complexity– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?

Language Models for Information Retrieval - NTNUberlin.csie.ntnu.edu.tw/Courses/Information...

Transcript of Language Models for Information Retrieval - NTNUberlin.csie.ntnu.edu.tw/Courses/Information...

Language Models for Information Retrieval - NTNUberlin.csie.ntnu.edu.tw/Courses/Information...

Documents

Transcript of Language Models for Information Retrieval - NTNUberlin.csie.ntnu.edu.tw/Courses/Information...

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval…

Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information Retrieval De nition of Information Retrieval Information Retrieval (IR) is about

Introduction to Information Retrieval Information ...zhaojin/cs3245_2018/w11.pdf · Introduction to Information Retrieval CS3245 Information Retrieval Lecture 11: Probabilistic IR

Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 12-13 Probabilistic Information Retrieval.

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Multimedia Information Retrieval -Slidesdcalab.unipv.it/wp-content/uploads/2015/02/video_search.pdf · Multimedia Information Retrieval. Multimedia Information Retrieval Motivation

Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval.

week 10 Information retrieval presentationlsir retrieval...Information Retrieval and Data Mining Part 1 – Information Retrieval. 2 ©2006/7, Karl Aberer, ... "web information retrieval"

Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.

Information Retrieval and Extraction - Berlin Chen's ...berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Information Retrieval and Extraction ... Concepts and Technology

Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

WMES3103 INFORMATION RETRIEVAL WEEK 1 AND 2. WHAT IS INFORMATION RETRIEVAL? Information Retrieval – IR Information Retrieval – IR Information Information.

1 Retrieval: Getting Information Out Module 27. 2 Retrieval: Getting Information Out Retrieval Cues.

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Information Retrieval

Models for Retrieval and Browsing - NTNUberlin.csie.ntnu.edu.tw/Courses/2004F-Information...• Also called the “information retrieval models” • Ranking Algorithms – Predict

Multimedia Information Retrieval -Slidesmhd/8337sp09/mir.pdfMultimedia Information RetrievalMultimedia Information Retrieval Text-based Information Retrieval ¾Too many images to annotate

CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?