Offline Testing Search Engine Results
-
Upload
stephane-bottine -
Category
Data & Analytics
-
view
103 -
download
3
Transcript of Offline Testing Search Engine Results
OfflineTestingSearchEngineResults
theexperiment
transitionfromFASTtoSolr
FASTSolr
myclientwastransitioning itssitesearchengine fromFASTtoSOLRandwantedthenewsearchengine toreturnmatchsearchresults.Thisisaunlikemostexperimentswhichinvolvesomeformofoptimization.
2
testingmethodology
offlinetestingdoesn’thappeninisolationandisoftenaprecursortoABtesting
ABtestinglargescale
quantitativeusertesting
usertestingsmallscale
qualitativetesting
offlinetestingfocusondifferencesin
searchresults
performancetestingfocusonspeedandqueriespersecond
regressiontestingisanythingbroken?
3
technologystack
FASTRealuserqueries results
SOLR
analysis
weran‘000sofrealuserqueriesagainsteachsearchengineandsavedtheresultsforanalysis
4
offlinetestingmetrics
yourchoiceofmetricswilldependonyourgoalsandavailabilityofinformation
• differencesinresultscounts useeitherthemeanabsolutedifferenceinresults,alsoknownasmeanabsoluteerror(MAE),orrootmeansquarederror(RMSE)tomeasuredifferencesincounts.Itishelpfultoexpressthismetricinrelativeterms,asapercentageoftheaveragenumberofresultsreturnedbytheexistingsearchengine.
• howmanyresultsoverlap:usetheJaccard indexortheSørensen-Diceindextomeasuresimilarityacrosssetsofresults.
• rankcorrelation:useSpearman’srankcorrelationcoefficient(Spearman’srho)tomeasurecorrelationacrossoverlappingresults.
5
offlinetestingmetrics
• precisionandrecall: precisionandrecallareoftenusedthemeasurethequalityofinformationretrievalsystems.Thesemetricsimplicitlyassumethatthecurrentsetofresultsisa“goldstandard”,andarebestsuitedwhenresultsaresortedbyrelevance.
• clickmetrics:ifhavehavearecordonuserinteractionswithyourexistingsetofsearchresults,youcouldforecastclick-metricsforthenewsearchengine.Commonmetricsincludetheaveragenumberofclicksperquery,aswellastheaverageormeanclickrank.
6
analysisframework
#FASTresults
#Solr results
Solr =FAST
everyrepresentsaquery
Ourclientsfind thatascatterplotisahelpfulwaytovisualizedifferenceincounts.
Inanidealworld,everyquerywouldlieonastraight-line passingthrough theorigin (indicativeofaperfectmatchincountsbetweentheoldandthenewengine).
However,bugsanddifferencesinindexationcanforcepointsawayfromthatlineandontoeithertheXortheYaxis.
7
differencesincountsAcrossqueryi
Fi
Si
Acrossallqueries
RMSE =Fi − Si( )2
i=1
n
∑n
TheRootMeanSquaredErrormeasurestheaveragedifferenceinthenumberofsearchresultsfound.
TheCoefficientofVariationexpressestheRMSEinrelativeterms:
CV =RMSE
Avg. #FAST results
FASTresultsABCDEFGHIJ
SOLRresultsBWCXDYZ
Fi isthenumberofFASTresultsforqueryiSi isthenumberofSOLRresultsforqueryi
8
overlap
FASTresultsABCDEFGHIJ
SOLRresultsBWCXDYZ
Acrossqueryi Acrossallqueries
YoucouldusetheSørensen-Diceindextomeasurethesimilarityofsetsforeachquery.Itisboundedbetween0and1(1isdesirableand
indicativeofperfectoverlap).
Simi (Fi,Si ) =2 Fi Si∩Fi + Si
Fi isthesetofFASTresultsforqueryi atranknSi isthesetofSOLRresultsforqueryi atrankn
9
rankcorrelation
FASTresultsABCDEFGHIJ
SOLRresultsBWCXDYZ
Acrossqueryi Acrossallqueries
YoucoulduseSpearman’s ranktocalculatecorrelationsacross
overlapping results.Thismetricisbounded between-1and1(1isdesirableandindicativeofperfect
positivecorrelation).
dj isthedifferenceinranks forthejthresultandiscalculatedas:FASTrankj - SOLRrankj
n isthenumberofoverlappingresultsforqueryi
ρi =1−6 Σdj
2
n (n2 −1)
10
differencesincountsbefore
Quadrant QueryCount
QueryShare (%) RMSE CV(%)
FAST >0SOLR>0 7,049 82% 1,749,463 8,588%
FAST>0SOLR= 0 388 5% 1,718 403%
FAST =0SOLR>0 107 1% 7,078,404 NA
FAST =0SOLR=0 1,037 12% 0 NA
Overall 8,581 100% 1,771,711 10,575%
Querytype QueryCount
QueryShare (%) RMSE CV(%)
Wildcard queries 690 8% 99,894 298%
Loose phrasequeries 925 11% 15,905 188%
Plural-form queries 1,090 13% 13,845 207%
Automated insightsbyquery type. Notethataquerymaybeassociatedwithmorethanonetype.
Differencesincounts aredrivenbyahandfulofquerieswith0orfewresultsinFASTandmillionsinSOLR
11
differencesincountsafter
Quadrant QueryCount
QueryShare (%) RMSE CV(%)
FAST >0SOLR>0 5,166 61% 213 2%
FAST>0SOLR= 0 125 1% 4,332 1,023%
FAST =0SOLR>0 20 0% 131 NA
FAST =0SOLR=0 3,169 37% 0 NA
Overall 8,480 100% 418 28%
Querytype QueryCount
QueryShare (%) RMSE CV(%)
Wildcard queries 665 8% 159 4%
Loose phrasequeries 910 11% 137 19%
Plural-form queries 2,356 28% 1,009 69%
Afterseveraliterations,differencesincountsaredownto2%onaverage acrossqueriesthatreturnresultsinFASTandSOLR.However,thereisstillmoreworktobedone.
12
overlapbefore
ResultCounts *
ResultOverlap
JaccardIndex
Sorensen-Dice Index
5+ 1 0.21 0.26
10+ 3 0.23 0.30
20+ 7 0.26 0.34
*Measuredacrossthefirst5,10and20resultsforqueriesthatreturned5+,10+and20+resultsrespectively.
Onaverage,just7ofthefirst20results(and1ofthefirst5results)overlappedatthestartoftheprocess.
13
overlapafter
ResultCounts *
ResultOverlap
JaccardIndex
Sorensen-DiceIndex
5+ 5 0.89 0.92
10+ 9 0.90 0.93
20+ 19 0.91 0.94
Afterseveraliterations,overlaphasrisento19resultsonpage1,up from7atthestartoftheprocess.Crucially,5ofthefirst5resultsoverlaponaverage.
*Measuredacrossthefirst5,10and20resultsforqueriesthatreturned5+,10+and20+resultsrespectively.
14
rankcorrelationbefore
ResultCounts *
Spearman’sRank(datedesc.)
5+ 0.97
10+ 0.96
20+ 0.96
*Measuredacrossthefirst5,10and20resultsforqueriesthatreturned5+,10+and20+resultsrespectively.
Rankcorrelationwasparticularlystrongfromtheoutsetat0.96/1asalmostallsearchesaresortedbydateratherthanbyrelevance.
15
rankcorrelationafter
ResultCounts *
Spearman’sRank(datedesc.)
5+ 0.98
10+ 0.98
20+ 0.99
*Measuredacrossthefirst5,10and20resultsforqueriesthatreturned5+,10+and20+resultsrespectively.
Afterseveraliterations,rankcorrelationacrossoverlapping resultshasimprovedfurtherto0.99/1.
16
Thispresentationillustratedhowyoucouldofflinetestsearchengineresults.However,as
everyimplementationisunique,pleasecontactustodiscussyourneeds.