Introduction to Information Retrieval - Stanford...
Transcript of Introduction to Information Retrieval - Stanford...
5/21/17
1
IntroductiontoInformationRetrieval
Introductionto
InformationRetrieval
ProbabilisticInformationRetrievalChristopherManning andPanduNayak
IntroductiontoInformationRetrieval
FROMBOOLEANTORANKEDRETRIEVAL…INTWOSTEPS
IntroductiontoInformationRetrieval
Rankedretrieval§ Thusfar,ourquerieshaveallbeenBoolean
§ Documentseithermatchordon’t§ Canbegoodforexpertuserswithpreciseunderstandingoftheirneedsandthecollection§ Canalsobegoodforapplications:Applicationscaneasilyconsume1000sofresults
§ Notgoodforthemajorityofusers§ MostusersincapableofwritingBooleanqueries
§ Ortheyare,buttheythinkit’stoomuchwork
§ Mostusersdon’twanttowadethrough1000sofresults.§ Thisisparticularlytrueofwebsearch
Ch. 6 IntroductiontoInformationRetrieval
ProblemwithBooleansearch:feastorfamine§ Booleanqueriesoftenresultineithertoofew(=0)ortoomany(1000s)results.
§ Query1:“standarduserdlink 650”→200,000hits§ Query2:“standarduserdlink 650nocardfound”:0hits
§ Ittakesalotofskilltocomeupwithaquerythatproducesamanageablenumberofhits.§ ANDgivestoofew;ORgivestoomany
Ch. 6
IntroductiontoInformationRetrieval
Who are these people?
Stephen Robertson Keith van RijsbergenKaren Spärck Jones
IntroductiontoInformationRetrieval
WhyprobabilitiesinIR?
User Information Need
DocumentsDocument
Representation
QueryRepresentation
How to match?
In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?
Uncertain guess ofwhether document has relevant content
Understandingof user need isuncertain
5/21/17
2
IntroductiontoInformationRetrieval
ProbabilisticIRtopics
1. Classicalprobabilisticretrievalmodel§ Probabilityrankingprinciple,etc.§ Binaryindependencemodel(≈NaïveBayestextcat)§ (Okapi)BM25
2. Bayesiannetworksfortextretrieval3. LanguagemodelapproachtoIR
§ Animportantemphasisinrecentwork
ProbabilisticmethodsareoneoftheoldestbutalsooneofthecurrentlyhottopicsinIR
§ Traditionally:neatideas,butdidn’twinonperformance§ Itseemstobedifferentnow
IntroductiontoInformationRetrieval
Thedocumentrankingproblem§ Wehaveacollectionofdocuments§ Userissuesaquery§ Alistofdocumentsneedstobereturned§ RankingmethodisthecoreofmodernIRsystems:
§ Inwhatorderdowepresentdocumentstotheuser?§ Wewantthe“best”documenttobefirst,secondbestsecond,etc….
§ Idea:Rankbyprobabilityofrelevanceofthedocumentw.r.t.informationneed§ P(R=1|documenti,query)
IntroductiontoInformationRetrieval
§ ForeventsAandB:
§ Bayes’ Rule
§ Odds:
Prior
p(A,B) = p(A∩B) = p(A | B)p(B) = p(B | A)p(A)
p(A | B) = p(B | A)p(A)p(B)
=p(B | A)p(A)
p(B | X)p(X)X=A,A∑
Recallafewprobabilitybasics
O(A) = p(A)p(A)
=p(A)
1− p(A)
Posterior
IntroductiontoInformationRetrieval
“Ifareferenceretrievalsystem’sresponsetoeachrequestisarankingofthedocumentsinthecollectioninorderofdecreasingprobabilityofrelevancetotheuserwhosubmittedtherequest,wheretheprobabilitiesareestimatedasaccuratelyaspossibleonthebasisofwhateverdatahavebeenmadeavailabletothesystemforthispurpose,theoveralleffectivenessofthesystemtoitsuserwillbethebestthatisobtainableonthebasisofthosedata.”
§ [1960s/1970s]S.Robertson,W.S.Cooper,M.E.Maron;van Rijsbergen (1979:113);Manning&Schütze (1999:538)
TheProbabilityRankingPrinciple
IntroductiontoInformationRetrieval
ProbabilityRankingPrinciple
Letx representadocumentinthecollection.LetR representrelevanceofadocumentw.r.t.given(fixed)queryandletR=1 representrelevantandR=0 notrelevant.
p(R =1| x) = p(x | R =1)p(R =1)p(x)
p(R = 0 | x) = p(x | R = 0)p(R = 0)p(x)
p(x|R=1), p(x|R=0) - probability that if a relevant (not relevant) document is retrieved, it is x.
Needtofindp(R=1|x) – probabilitythatadocumentx isrelevant.
p(R=1),p(R=0) - prior probabilityof retrieving a relevant or non-relevantdocument
p(R = 0 | x)+ p(R =1| x) =1
IntroductiontoInformationRetrieval
ProbabilityRankingPrinciple(PRP)§ Simplecase:noselectioncostsorotherutilityconcernsthatwoulddifferentiallyweighterrors
§ PRPinaction:Rankalldocumentsbyp(R=1|x)
§ Theorem:UsingthePRPisoptimal,inthatitminimizestheloss(Bayesrisk)under1/0loss§ Provableifallprobabilitiescorrect,etc.[e.g.,Ripley1996]
5/21/17
3
IntroductiontoInformationRetrieval
ProbabilityRankingPrinciple
§ Morecomplexcase:retrievalcosts.§ Letd beadocument§ C– costofnotretrievingarelevant document§ C’ – costofretrievinganon-relevant document
§ ProbabilityRankingPrinciple:if
foralld’notyetretrieved,thend isthenextdocumenttoberetrieved
§ Wewon’tfurtherconsidercost/utilityfromnowon
!C ⋅ p(R = 0 | d)−C ⋅ p(R =1| d) ≤ !C ⋅ p(R = 0 | !d )−C ⋅ p(R =1| !d )
IntroductiontoInformationRetrieval
ProbabilityRankingPrinciple§ Howdowecomputeallthoseprobabilities?
§ Donotknowexactprobabilities,havetouseestimates§ BinaryIndependenceModel(BIM)– whichwediscussnext– isthesimplestmodel
§ Questionableassumptions§ “Relevance”ofeachdocumentisindependentofrelevanceofotherdocuments.§ Really,it’sbadtokeeponreturningduplicates
§ Booleanmodelofrelevance§ Thatonehasasinglestepinformationneed
§ Seeingarangeofresultsmightletuserrefinequery
IntroductiontoInformationRetrieval
ProbabilisticRetrievalStrategy
§ Estimatehowtermscontributetorelevance§ Howdootherthingsliketermfrequencyanddocumentlengthinfluenceyourjudgmentsaboutdocumentrelevance?§ NotatallinBIM§ AmorenuancedansweristheOkapi(BM25)formulae[nexttime]
§ Spärck Jones/Robertson
§ Combinetofinddocumentrelevanceprobability
§ Orderdocumentsbydecreasingprobability
IntroductiontoInformationRetrieval
ProbabilisticRanking
Basic concept:“For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.
By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.”
Van Rijsbergen
IntroductiontoInformationRetrieval
BinaryIndependenceModel§ TraditionallyusedinconjunctionwithPRP§ “Binary”=Boolean:documentsarerepresentedasbinary
incidencevectorsofterms(cf.IIRChapter1):§
§ iff termi ispresentindocumentx.§ “Independence”: termsoccurindocumentsindependently§ Differentdocumentscanbemodeledasthesamevector
),,( 1 nxxx !"=1=ix
IntroductiontoInformationRetrieval
BinaryIndependenceModel§ Queries:binarytermincidencevectors§ Givenqueryq,
§ foreachdocumentd needtocomputep(R|q,d).§ replacewithcomputingp(R|q,x) where x isbinarytermincidencevectorrepresentingd.
§ Interestedonlyinranking§ WilluseoddsandBayes’Rule:
O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)
=
p(R =1| q)p(x | R =1,q)p(x | q)
p(R = 0 | q)p(x | R = 0,q)p(x | q)
5/21/17
4
IntroductiontoInformationRetrieval
BinaryIndependenceModel
• Using Independence Assumption:
O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1
n
∏
p(x | R =1,q)p(x | R = 0,q)
=p(xi | R =1,q)p(xi | R = 0,q)i=1
n
∏
O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)
=p(R =1| q)p(R = 0 | q)
⋅p(x | R =1,q)p(x | R = 0,q)
Constant for a given query Needs estimation
IntroductiontoInformationRetrieval
BinaryIndependenceModel
• Since xi is either 0 or 1:
O(R | q, x) =O(R | q) ⋅ p(xi =1| R =1,q)p(xi =1| R = 0,q)xi=1
∏ ⋅p(xi = 0 | R =1,q)p(xi = 0 | R = 0,q)xi=0
∏
• Let pi = p(xi =1| R =1,q); ri = p(xi =1| R = 0,q);
• Assume, for all terms not occurring in the query (qi=0) ii rp =
O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1
n
∏
O(R | q, x) =O(R | q) ⋅ pirixi=1
qi=1
∏ ⋅(1− pi )(1− ri )xi=0
qi=1
∏
IntroductiontoInformationRetrieval
document relevant(R=1) notrelevant(R=0)termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)
IntroductiontoInformationRetrieval
All matching terms Non-matching query terms
BinaryIndependenceModel
All matching termsAll query terms
O(R | q, x) =O(R | q) ⋅ pirixi=1
qi=1
∏ ⋅1− ri1− pi
⋅1− pi1− ri
$
%&
'
()
xi=1qi=1
∏ 1− pi1− rixi=0
qi=1
∏
O(R | q, x) =O(R | q) ⋅ pi (1− ri )ri (1− pi )xi=qi=1
∏ ⋅1− pi1− riqi=1
∏
O(R | q, x) =O(R | q) ⋅ pirixi=qi=1
∏ ⋅1− pi1− rixi=0
qi=1
∏
IntroductiontoInformationRetrieval
BinaryIndependenceModel
Constant foreach query
Only quantity to be estimated for rankings
ÕÕ=== -
-×
--
×=11 11
)1()1()|(),|(
iii q i
i
qx ii
ii
rp
prrpqROxqRO !
RetrievalStatusValue:
åÕ==== -
-=
--
=11 )1(
)1(log)1()1(log
iiii qx ii
ii
qx ii
ii
prrp
prrpRSV
IntroductiontoInformationRetrieval
BinaryIndependenceModel
AllboilsdowntocomputingRSV.
åÕ==== -
-=
--
=11 )1(
)1(log)1()1(log
iiii qx ii
ii
qx ii
ii
prrp
prrpRSV
å==
=1;
ii qxicRSV
)1()1(logii
iii pr
rpc--
=
So,howdowecomputeci’s fromourdata?
Theci arelogoddsratiosTheyfunctionasthetermweightsinthismodel
5/21/17
5
IntroductiontoInformationRetrieval
BinaryIndependenceModel• EstimatingRSVcoefficients intheory• Foreachtermi lookatthistableofdocumentcounts:
Documents
Relevant Non-Relevant Total
xi=1 s n-s n xi=0 S-s N-n-S+s N-n Total S N-S N
Sspi » )(
)(SNsnri -
-»
)()()(log),,,(
sSnNsnsSssSnNKci +---
-=»
• Estimates: For now,assume nozero terms.Remembersmoothing.
)1()1(logii
iii pr
rpc--
=
IntroductiontoInformationRetrieval
Estimation– keychallenge
§ Ifnon-relevantdocumentsareapproximatedbythewholecollection,thenri (prob.ofoccurrenceinnon-relevantdocumentsforquery)isn/Nand
log1− riri
= log N − n− S + sn− s
≈ log N − nn
≈ log Nn= IDF!
IntroductiontoInformationRetrieval
Collectionvs.Documentfrequency
§ Collectionfrequencyoft isthetotal numberofoccurrencesoft inthecollection(incl.multiples)
§ Documentfrequencyisnumberofdocstisin§ Example:
§ Whichwordisabettersearchterm(andshouldgetahigherweight)?
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
Sec. 6.2.1 IntroductiontoInformationRetrieval
Estimation– keychallenge
§ pi (probabilityofoccurrenceinrelevantdocuments)cannotbeapproximatedaseasily
§ pi canbeestimatedinvariousways:§ fromrelevantdocumentsifyouknowsome
§ Relevanceweightingcanbeusedinafeedbackloop
§ constant(CroftandHarpercombinationmatch)– thenjustgetidf weightingofterms(withpi=0.5)
§ proportionaltoprob.ofoccurrenceincollection§ Greiff (SIGIR1998)arguesfor1/3+2/3dfi/N
RSV = log Nnixi=qi=1
∑
IntroductiontoInformationRetrieval
ProbabilisticRelevanceFeedback1. GuessapreliminaryprobabilisticdescriptionofR=1
documents;useittoretrieveasetofdocuments2. Interactwiththeusertorefinethedescription:
learnsomedefinitememberswithR=1andR=03. Re-estimatepi andri onthebasisofthese
§ Ifi appearsinVi withinsetofdocumentsV:pi =|Vi|/|V|§ Orcancombinenewinformationwithoriginalguess(use
Bayesianprior):
4. Repeat,thusgeneratingasuccessionofapproximationstorelevantdocuments
kk++
=|||| )1(
)2(
VpVp ii
iκ is prior
weight
IntroductiontoInformationRetrieval
30
Iterativelyestimatingpi and ri(=Pseudo-relevancefeedback)1. Assumethatpi isconstantoverallxi inqueryandri
asbefore§ pi =0.5(evenodds)foranygivendoc
2. Determineguessofrelevantdocumentset:§ V isfixedsizesetofhighestrankeddocumentsonthis
model3. Weneedtoimproveourguessesforpi andri,so
§ Usedistributionofxi indocsinV.LetVi besetofdocumentscontainingxi§ pi =|Vi|/|V|
§ Assumeifnotretrievedthennotrelevant§ ri =(ni – |Vi|)/(N– |V|)
4. Goto2.untilconvergesthenreturnranking
5/21/17
6
IntroductiontoInformationRetrieval
PRPandBIM
§ Gettingreasonableapproximationsofprobabilitiesispossible.
§ Requiresrestrictiveassumptions:§ Termindependence§ Termsnotinquerydon’taffecttheoutcome§ Booleanrepresentationofdocuments/queries/relevance
§ Documentrelevancevaluesareindependent§ Someoftheseassumptionscanberemoved§ Problem:eitherrequirepartialrelevanceinformationor
seeminglyonlycanderivesomewhatinferiortermweights
IntroductiontoInformationRetrieval
Removingtermindependence§ Ingeneral,indextermsaren’t
independent§ Dependenciescanbecomplex§ vanRijsbergen (1979)proposed
simplemodelofdependenciesasatree§ ExactlyFriedmanand
Goldszmidt’s TreeAugmentedNaiveBayes(AAAI13,1996)
§ Eachtermdependentononeother
§ In1970s,estimationproblemsheldbacksuccessofthismodel
IntroductiontoInformationRetrieval
Secondstep:Termfrequency§ Rightinthefirstlecture,wesaidthatapageshouldrankhigherifitmentionsawordmore§ Perhapsmodulatedbythingslikepagelength
§ Wemightwantamodelwithtermfrequencyinit.
§ We’llseeaprobabilisticonenexttime– BM25
§ Quicksummaryofvectorspacemodel
IntroductiontoInformationRetrieval
Summary– vectorspaceranking
§ Representthequeryasaweightedtermfrequency/inversedocumentfrequency(tf-idf)vector
§ Representeachdocumentasaweightedtf-idf vector§ Computethecosinesimilarityscoreforthequeryvectorandeachdocumentvector
§ Rankdocumentswithrespecttothequerybyscore§ ReturnthetopK (e.g.,K =10)totheuser
IntroductiontoInformationRetrieval IntroductiontoInformationRetrieval
Cosinesimilarity
Sec. 6.3
5/21/17
7
IntroductiontoInformationRetrieval
tf-idfweightinghasmanyvariants
Sec. 6.4 IntroductiontoInformationRetrieval
ResourcesS.E.RobertsonandK.Spärck Jones.1976.RelevanceWeightingofSearch
Terms.JournaloftheAmericanSocietyforInformationSciences27(3):129–146.
C.J.vanRijsbergen.1979.InformationRetrieval. 2nded.London:Butterworths,chapter6.[Mostdetailsofmath]http://www.dcs.gla.ac.uk/Keith/Preface.html
N.Fuhr.1992.ProbabilisticModelsinInformationRetrieval.TheComputerJournal,35(3),243–255.[Easiestread,withBNs]
F.Crestani,M.Lalmas,C.J.vanRijsbergen,andI.Campbell.1998.IsThisDocumentRelevant?…Probably:ASurveyofProbabilisticModelsinInformationRetrieval.ACMComputingSurveys 30(4):528–552.http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/[Addsverylittlematerialthatisn’tinvanRijsbergen orFuhr ]