Introduction to Information Retrieval - Stanford...

5/21/17

1

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval

ProbabilisticInformationRetrievalChristopherManning andPanduNayak


FROMBOOLEANTORANKEDRETRIEVAL…INTWOSTEPS


Rankedretrieval§ Thusfar,ourquerieshaveallbeenBoolean

§ Documentseithermatchordon’t§ Canbegoodforexpertuserswithpreciseunderstandingoftheirneedsandthecollection§ Canalsobegoodforapplications:Applicationscaneasilyconsume1000sofresults

§ Notgoodforthemajorityofusers§ MostusersincapableofwritingBooleanqueries

§ Ortheyare,buttheythinkit’stoomuchwork

§ Mostusersdon’twanttowadethrough1000sofresults.§ Thisisparticularlytrueofwebsearch

Ch. 6 IntroductiontoInformationRetrieval

ProblemwithBooleansearch:feastorfamine§ Booleanqueriesoftenresultineithertoofew(=0)ortoomany(1000s)results.

§ Query1:“standarduserdlink 650”→200,000hits§ Query2:“standarduserdlink 650nocardfound”:0hits

§ Ittakesalotofskilltocomeupwithaquerythatproducesamanageablenumberofhits.§ ANDgivestoofew;ORgivestoomany

Ch. 6


Who are these people?

Stephen Robertson Keith van RijsbergenKaren Spärck Jones


WhyprobabilitiesinIR?

User Information Need

DocumentsDocument

Representation

QueryRepresentation

How to match?

In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of index terms.

Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?

Uncertain guess ofwhether document has relevant content

Understandingof user need isuncertain

5/21/17

2


ProbabilisticIRtopics

1. Classicalprobabilisticretrievalmodel§ Probabilityrankingprinciple,etc.§ Binaryindependencemodel(≈NaïveBayestextcat)§ (Okapi)BM25

2. Bayesiannetworksfortextretrieval3. LanguagemodelapproachtoIR

§ Animportantemphasisinrecentwork

ProbabilisticmethodsareoneoftheoldestbutalsooneofthecurrentlyhottopicsinIR

§ Traditionally:neatideas,butdidn’twinonperformance§ Itseemstobedifferentnow


Thedocumentrankingproblem§ Wehaveacollectionofdocuments§ Userissuesaquery§ Alistofdocumentsneedstobereturned§ RankingmethodisthecoreofmodernIRsystems:

§ Inwhatorderdowepresentdocumentstotheuser?§ Wewantthe“best”documenttobefirst,secondbestsecond,etc….

§ Idea:Rankbyprobabilityofrelevanceofthedocumentw.r.t.informationneed§ P(R=1|documenti,query)


§ ForeventsAandB:

§ Bayes’ Rule

§ Odds:

Prior

p(A,B) = p(A∩B) = p(A | B)p(B) = p(B | A)p(A)

p(A | B) = p(B | A)p(A)p(B)

=p(B | A)p(A)

p(B | X)p(X)X=A,A∑

Recallafewprobabilitybasics

O(A) = p(A)p(A)

=p(A)

1− p(A)

Posterior


“Ifareferenceretrievalsystem’sresponsetoeachrequestisarankingofthedocumentsinthecollectioninorderofdecreasingprobabilityofrelevancetotheuserwhosubmittedtherequest,wheretheprobabilitiesareestimatedasaccuratelyaspossibleonthebasisofwhateverdatahavebeenmadeavailabletothesystemforthispurpose,theoveralleffectivenessofthesystemtoitsuserwillbethebestthatisobtainableonthebasisofthosedata.”

§ [1960s/1970s]S.Robertson,W.S.Cooper,M.E.Maron;van Rijsbergen (1979:113);Manning&Schütze (1999:538)

TheProbabilityRankingPrinciple


ProbabilityRankingPrinciple

Letx representadocumentinthecollection.LetR representrelevanceofadocumentw.r.t.given(fixed)queryandletR=1 representrelevantandR=0 notrelevant.

p(R =1| x) = p(x | R =1)p(R =1)p(x)

p(R = 0 | x) = p(x | R = 0)p(R = 0)p(x)

p(x|R=1), p(x|R=0) - probability that if a relevant (not relevant) document is retrieved, it is x.

Needtofindp(R=1|x) – probabilitythatadocumentx isrelevant.

p(R=1),p(R=0) - prior probabilityof retrieving a relevant or non-relevantdocument

p(R = 0 | x)+ p(R =1| x) =1


ProbabilityRankingPrinciple(PRP)§ Simplecase:noselectioncostsorotherutilityconcernsthatwoulddifferentiallyweighterrors

§ PRPinaction:Rankalldocumentsbyp(R=1|x)

§ Theorem:UsingthePRPisoptimal,inthatitminimizestheloss(Bayesrisk)under1/0loss§ Provableifallprobabilitiescorrect,etc.[e.g.,Ripley1996]

5/21/17

3


ProbabilityRankingPrinciple

§ Morecomplexcase:retrievalcosts.§ Letd beadocument§ C– costofnotretrievingarelevant document§ C’ – costofretrievinganon-relevant document

§ ProbabilityRankingPrinciple:if

foralld’notyetretrieved,thend isthenextdocumenttoberetrieved

§ Wewon’tfurtherconsidercost/utilityfromnowon

!C ⋅ p(R = 0 | d)−C ⋅ p(R =1| d) ≤ !C ⋅ p(R = 0 | !d )−C ⋅ p(R =1| !d )


ProbabilityRankingPrinciple§ Howdowecomputeallthoseprobabilities?

§ Donotknowexactprobabilities,havetouseestimates§ BinaryIndependenceModel(BIM)– whichwediscussnext– isthesimplestmodel

§ Questionableassumptions§ “Relevance”ofeachdocumentisindependentofrelevanceofotherdocuments.§ Really,it’sbadtokeeponreturningduplicates

§ Booleanmodelofrelevance§ Thatonehasasinglestepinformationneed

§ Seeingarangeofresultsmightletuserrefinequery


ProbabilisticRetrievalStrategy

§ Estimatehowtermscontributetorelevance§ Howdootherthingsliketermfrequencyanddocumentlengthinfluenceyourjudgmentsaboutdocumentrelevance?§ NotatallinBIM§ AmorenuancedansweristheOkapi(BM25)formulae[nexttime]

§ Spärck Jones/Robertson

§ Combinetofinddocumentrelevanceprobability

§ Orderdocumentsbydecreasingprobability


ProbabilisticRanking

Basic concept:“For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.”

Van Rijsbergen


BinaryIndependenceModel§ TraditionallyusedinconjunctionwithPRP§ “Binary”=Boolean:documentsarerepresentedasbinary

incidencevectorsofterms(cf.IIRChapter1):§

§ iff termi ispresentindocumentx.§ “Independence”: termsoccurindocumentsindependently§ Differentdocumentscanbemodeledasthesamevector

),,( 1 nxxx !"=1=ix


BinaryIndependenceModel§ Queries:binarytermincidencevectors§ Givenqueryq,

§ foreachdocumentd needtocomputep(R|q,d).§ replacewithcomputingp(R|q,x) where x isbinarytermincidencevectorrepresentingd.

§ Interestedonlyinranking§ WilluseoddsandBayes’Rule:

O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)

=

p(R =1| q)p(x | R =1,q)p(x | q)

p(R = 0 | q)p(x | R = 0,q)p(x | q)

5/21/17

4


BinaryIndependenceModel

• Using Independence Assumption:

O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1

n

∏

p(x | R =1,q)p(x | R = 0,q)

=p(xi | R =1,q)p(xi | R = 0,q)i=1

n

∏

O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)

=p(R =1| q)p(R = 0 | q)

⋅p(x | R =1,q)p(x | R = 0,q)

Constant for a given query Needs estimation



• Since xi is either 0 or 1:

O(R | q, x) =O(R | q) ⋅ p(xi =1| R =1,q)p(xi =1| R = 0,q)xi=1

∏ ⋅p(xi = 0 | R =1,q)p(xi = 0 | R = 0,q)xi=0

∏

• Let pi = p(xi =1| R =1,q); ri = p(xi =1| R = 0,q);

• Assume, for all terms not occurring in the query (qi=0) ii rp =

O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1

n

∏

O(R | q, x) =O(R | q) ⋅ pirixi=1

qi=1

∏ ⋅(1− pi )(1− ri )xi=0

qi=1

∏


document relevant(R=1) notrelevant(R=0)termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)


All matching terms Non-matching query terms


All matching termsAll query terms

O(R | q, x) =O(R | q) ⋅ pirixi=1

qi=1

∏ ⋅1− ri1− pi

⋅1− pi1− ri

$

%&

'

()

xi=1qi=1

∏ 1− pi1− rixi=0

qi=1

∏

O(R | q, x) =O(R | q) ⋅ pi (1− ri )ri (1− pi )xi=qi=1

∏ ⋅1− pi1− riqi=1

∏

O(R | q, x) =O(R | q) ⋅ pirixi=qi=1

∏ ⋅1− pi1− rixi=0

qi=1

∏



Constant foreach query

Only quantity to be estimated for rankings

ÕÕ=== -

-×

--

×=11 11

)1()1()|(),|(

iii q i

i

qx ii

ii

rp

prrpqROxqRO !

RetrievalStatusValue:

åÕ==== -

-=

--

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV



AllboilsdowntocomputingRSV.

åÕ==== -

-=

--

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV

å==

=1;

ii qxicRSV

)1()1(logii

iii pr

rpc--

=

So,howdowecomputeci’s fromourdata?

Theci arelogoddsratiosTheyfunctionasthetermweightsinthismodel

5/21/17

5


BinaryIndependenceModel• EstimatingRSVcoefficients intheory• Foreachtermi lookatthistableofdocumentcounts:

Documents

Relevant Non-Relevant Total

xi=1 s n-s n xi=0 S-s N-n-S+s N-n Total S N-S N

Sspi » )(

)(SNsnri -

-»

)()()(log),,,(

sSnNsnsSssSnNKci +---

-=»

• Estimates: For now,assume nozero terms.Remembersmoothing.

)1()1(logii

iii pr

rpc--

=


Estimation– keychallenge

§ Ifnon-relevantdocumentsareapproximatedbythewholecollection,thenri (prob.ofoccurrenceinnon-relevantdocumentsforquery)isn/Nand

log1− riri

= log N − n− S + sn− s

≈ log N − nn

≈ log Nn= IDF!


Collectionvs.Documentfrequency

§ Collectionfrequencyoft isthetotal numberofoccurrencesoft inthecollection(incl.multiples)

§ Documentfrequencyisnumberofdocstisin§ Example:

§ Whichwordisabettersearchterm(andshouldgetahigherweight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1 IntroductiontoInformationRetrieval

Estimation– keychallenge

§ pi (probabilityofoccurrenceinrelevantdocuments)cannotbeapproximatedaseasily

§ pi canbeestimatedinvariousways:§ fromrelevantdocumentsifyouknowsome

§ Relevanceweightingcanbeusedinafeedbackloop

§ constant(CroftandHarpercombinationmatch)– thenjustgetidf weightingofterms(withpi=0.5)

§ proportionaltoprob.ofoccurrenceincollection§ Greiff (SIGIR1998)arguesfor1/3+2/3dfi/N

RSV = log Nnixi=qi=1

∑


ProbabilisticRelevanceFeedback1. GuessapreliminaryprobabilisticdescriptionofR=1

documents;useittoretrieveasetofdocuments2. Interactwiththeusertorefinethedescription:

learnsomedefinitememberswithR=1andR=03. Re-estimatepi andri onthebasisofthese

§ Ifi appearsinVi withinsetofdocumentsV:pi =|Vi|/|V|§ Orcancombinenewinformationwithoriginalguess(use

Bayesianprior):

4. Repeat,thusgeneratingasuccessionofapproximationstorelevantdocuments

kk++

=|||| )1(

)2(

VpVp ii

iκ is prior

weight


30

Iterativelyestimatingpi and ri(=Pseudo-relevancefeedback)1. Assumethatpi isconstantoverallxi inqueryandri

asbefore§ pi =0.5(evenodds)foranygivendoc

2. Determineguessofrelevantdocumentset:§ V isfixedsizesetofhighestrankeddocumentsonthis

model3. Weneedtoimproveourguessesforpi andri,so

§ Usedistributionofxi indocsinV.LetVi besetofdocumentscontainingxi§ pi =|Vi|/|V|

§ Assumeifnotretrievedthennotrelevant§ ri =(ni – |Vi|)/(N– |V|)

4. Goto2.untilconvergesthenreturnranking

5/21/17

6


PRPandBIM

§ Gettingreasonableapproximationsofprobabilitiesispossible.

§ Requiresrestrictiveassumptions:§ Termindependence§ Termsnotinquerydon’taffecttheoutcome§ Booleanrepresentationofdocuments/queries/relevance

§ Documentrelevancevaluesareindependent§ Someoftheseassumptionscanberemoved§ Problem:eitherrequirepartialrelevanceinformationor

seeminglyonlycanderivesomewhatinferiortermweights


Removingtermindependence§ Ingeneral,indextermsaren’t

independent§ Dependenciescanbecomplex§ vanRijsbergen (1979)proposed

simplemodelofdependenciesasatree§ ExactlyFriedmanand

Goldszmidt’s TreeAugmentedNaiveBayes(AAAI13,1996)

§ Eachtermdependentononeother

§ In1970s,estimationproblemsheldbacksuccessofthismodel


Secondstep:Termfrequency§ Rightinthefirstlecture,wesaidthatapageshouldrankhigherifitmentionsawordmore§ Perhapsmodulatedbythingslikepagelength

§ Wemightwantamodelwithtermfrequencyinit.

§ We’llseeaprobabilisticonenexttime– BM25

§ Quicksummaryofvectorspacemodel


Summary– vectorspaceranking

§ Representthequeryasaweightedtermfrequency/inversedocumentfrequency(tf-idf)vector

§ Representeachdocumentasaweightedtf-idf vector§ Computethecosinesimilarityscoreforthequeryvectorandeachdocumentvector

§ Rankdocumentswithrespecttothequerybyscore§ ReturnthetopK (e.g.,K =10)totheuser

IntroductiontoInformationRetrieval IntroductiontoInformationRetrieval

Cosinesimilarity

Sec. 6.3

5/21/17

7


tf-idfweightinghasmanyvariants

Sec. 6.4 IntroductiontoInformationRetrieval

ResourcesS.E.RobertsonandK.Spärck Jones.1976.RelevanceWeightingofSearch

Terms.JournaloftheAmericanSocietyforInformationSciences27(3):129–146.

C.J.vanRijsbergen.1979.InformationRetrieval. 2nded.London:Butterworths,chapter6.[Mostdetailsofmath]http://www.dcs.gla.ac.uk/Keith/Preface.html

N.Fuhr.1992.ProbabilisticModelsinInformationRetrieval.TheComputerJournal,35(3),243–255.[Easiestread,withBNs]

F.Crestani,M.Lalmas,C.J.vanRijsbergen,andI.Campbell.1998.IsThisDocumentRelevant?…Probably:ASurveyofProbabilisticModelsinInformationRetrieval.ACMComputingSurveys 30(4):528–552.http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/[Addsverylittlematerialthatisn’tinvanRijsbergen orFuhr ]

Introduction to Information Retrieval - Stanford...

Documents

Transcript of Introduction to Information Retrieval - Stanford...