Time Well Spentmsmucker/publications/clarke-s… · of document ranking algorithms [1,4,18,24,30]....

Time Well Spent

Charles L. A. ClarkeSchool of Computer Science

University of Waterloo, [email protected]

Mark D. SmuckerDepartment of Management Sciences

University of Waterloo, [email protected]

ABSTRACTTime-biased gain provides a general framework for predict-ing user performance on information retrieval systems, cap-turing the impact of the user’s interaction with the system’sinterface. Our prior work investigated an instantiation oftime-biased gain aimed at traditional search interfaces uti-lizing clickable result summaries, with gain realized from therecognition of relevant documents. In this paper, we exam-ine additional properties of time-biased gain, demonstrat-ing how it generalizes effectiveness measures from across thefield of information retrieval. We explore a new instantiationof time-biased gain, applicable to systems where the userjudges the quality of their experience by the amount of timewell spent. Rather than the single number produced by tra-ditional effectiveness measures, time-biased gain models uservariability and produces a distribution of gain on a per-querybasis. With this distribution, we can observe performancedifferences at the user level. We apply bootstrap samplingto estimate confidence intervals across multiple queries.

Categories and Subject DescriptorsH.3.4 [Information Storage and Retrieval]: Systemsand Software—Performance evaluation (efficiency and ef-fectiveness)

KeywordsSearch, evaluation

1. INTRODUCTIONProgress in information retrieval requires us to quantify

the performance of search engines and other informationaccess systems. Given two systems — perhaps the samesystem before and after some proposed improvement — wemight compare their performance in a production environ-ment through a straightforward A/B test. We assign in-coming users to one system or the other at random, andtheir preference for one system over the other is inferred

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, August 26–29, 2014, Regensburg, Germany.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2976-7/14/08 ...$15.00.http://dx.doi.org/10.1145/2637002.2637026.

from clickthroughs, dwell times, and other implicit feed-back [9, 18, 22]. Alternatively, we might conduct controlledexperiments in a laboratory setting, where user preferencesand performance can be measured by more direct methods,including think-aloud protocols and questionnaires [18].

Unfortunately, such user studies require substantial time,effort, and money. For many tasks — such as tuning orlearning a ranker — it must be possible to compute perfor-mance measures rapidly and repeatedly. Since this require-ment precludes user studies — even when the costs couldotherwise be tolerated — the computation of these perfor-mance measures must be entirely automatic.

Traditionally, information retrieval researchers have useda Cranfield-style evaluation framework to achieve this goal,particularly as a method for quantifying the performanceof document ranking algorithms [1, 4, 18, 24, 30]. A classicCranfield-style test collection comprises a corpus of docu-ments, a set of queries, and a set of relevance judgmentsrelating the two. To test a system, a researcher generates adocument ranking for each query, and applies any numberof standard and proposed effectiveness measures to quantifysystem performance on that query.

Such standard effectiveness measures include precision@k,normalized discounted cumulative gain [17], average preci-sion [24], rank-biased precision [20], and expected reciprocalrank [9]. Averaging these measures over the set of queriesprovides summary measures of effectiveness [23]. Statisti-cal testing provides a method for comparing systems underthese measures [8,26], with t-tests and bootstrapping widelyemployed for this purpose.

For an effectiveness measure to be meaningful, we musthave reasonable confidence that an increase in the measurewould translate to an improved user experience, assumingthe system were deployed in a production environment. Ide-ally, the effectiveness measure would also provide a indica-tion of the magnitude of the improvement, so that effectsizes could be considered in statistical testing. For example,an effectiveness measure might indicate how many more rel-evant documents we expect the user to see with System A,as compared to System B. A statistically significant increaseof 2.00 documents may have more practical significance thana statistically significant increase of 0.02 documents.

In a series of recent papers [27–29], we defined and devel-oped time-biased gain (TBG), a unified effectiveness mea-sure for information retrieval evaluation. TBG simulates apopulation of users interacting with an information accesssystem, computing both the benefits obtained by the usersand the time taken to achieve those benefits. By building

on an explicit interaction model, TBG can report perfor-mance in meaningful units, such as the number of relevantdocuments seen by the user. By accounting for user effortin terms of time, TBG can appropriately reflect the impactof captions, clicks, and other interface components. Finally,by simulating a population of users, TBG generates a per-query distribution of performance values, rather than thesingle number produced by traditional effectiveness mea-sures. Using this per-query distribution, TBG facilitatesthe computation of confidence intervals and effect sizes.

In our prior work, we investigated an instantiation of time-biased gain aimed at a traditional search interface, whichutilizes a ranked list of clickable result summaries. We re-ported performance in terms of the number of relevant doc-uments identified by the user. Building on this work, Sakaiand Dou [25] present an evaluation framework for searchinterfaces beyond the ranked list.

As the central concept in their framework, Sakai and Douintroduce the notion of a trailtext. Given a user interactingwith an information access system, a trailtext is formed byconcatenating all text seen by the user during the course oftheir interaction. They introduce a new evaluation measure,called U-measure, which is computed over a trailtext basedon the relevant content appearing within it, discounted bythe positions where this relevant content appears. Sakai andDou describe how their framework may be applied to eval-uate passage retrieval, summarization, question answering,and multi-query search sessions, as well as traditional docu-ment ranking. Since trailtexts may be generated from bothuser studies and user simulations, trailtexts provide a unifiedmethod for quantifying system performance.

Unfortunately, trailtexts do not appropriately account foruser variability and user effort. For example, under a tradi-tional search interface, the time required for a user to con-sider a caption and click on the link differs from the timerequired for that user to read the linked document and makea relevance decision. Users browse and read at differentspeeds, and these differing times are not accurately capturedby the relative lengths of the captions and documents seenby the user. Moreover, trailtexts ignore the impact of sys-tem components and non-textual information, and trailtextscannot be generated for video, images, or other multimediadata.

In this paper, we introduce a new instantiation of time-biased gain applicable to systems where the user views thequality of their experience by the amount of time well spent(TWS). We illustrate TWS by applying it to the evaluationof passage retrieval and employ as an exemplar the TREC2004 HARD Track [1]. This track explored passage retrievalas a method to save user effort as compared to documentretrieval. Instead of documents, systems returned passagesthat hopefully consisted of only relevant material. By di-rectly estimating the amount of time well spent, we can makemore meaningful comparisons between systems than existingpassage effectiveness measures. By generalizing the notionof a trailtext to a time-oriented trace of user activity, ourTWS measure can encompass a broader domain of applica-tion, while also accounting for user variability and effort. Inaddition to examining user variability at the query level,another contribution of this paper is a new user-focusedmethodology for the summarization of performance across aset of queries and the statistical comparison of performancebetween pairs of systems for use with effectiveness measures,

such as TWS, that model user variability and produce a dis-tribution of performance scores on a per-query basis.

2. TIME BIASED GAINThe most general form of time-biased gain imagines a user

interacting with a information access system — skimmingpages, viewing videos, reading text, or otherwise consumingcontent. From time to time, the user encounters materialthat generates some benefit for them, or gain, perhaps byproviding entertainment or satisfying some other informa-tion need. Expected gain over the lifetime of the interactionis computed by the equation

1

N

∫ ∞

0

D(t)dG(t). (1)

In this equation, G(t) represents a cumulative gain function,parameterized by time t, with G(t) increasing monotonicallyas t → ∞. We discuss the decay function D(t) and thenormalization factor, N , in detail later in the paper. Thecumulative gain can be expressed in any reasonable unit,as appropriate to the interface and task. For example, atraditional search engine might respond to a query with alist of summaries and links to full documents. The userscans the summaries, clicks on links that look interesting,and reads the linked documents. In our prior work [27–29], we explored an instantiation of TBG for this traditionalsearch interface, expressing gain in terms of the number ofrelevant documents recognized by the user.

2.1 Cumulative gainThe details for expressing cumulative gain depend on the

details of the interface, as well as on the needs and actionsof the user. In this paper, we focus on a simple but generalmeasure of cumulative gain, which we call time well spent(TWS). For a text-oriented system, we might express TWSas the total time spent reading relevant material vs. the totaltime spent interacting with the system. Equivalently, for avideo retrieval system, we might express TWS as the totaltime spent viewing entertaining material vs. the total timespent. If we consider some relevant material to be superior toother relevant material, e.g., if we consider reading a highlyrelevant document to be preferable to reading a marginallyrelevant document, time might be better spent on the morerelevant material, and gain might accumulate faster over it.

2.2 DecayJust as the function G(t) represents the benefit received

by the user for an investment of time t, the decay functionD(t) represents the user’s willingness to invest this time.The value D(t) provides the survival probability at time t, in-dicating whether the user continues working until that time,with D(t) decreasing monotonically to 0 as t → ∞. In ourprior work, we adopted an exponential decay function. Wesupported this choice through an analysis of queries andclicks mined from the logs of a commercial search engine.As we show in Section 5.3, other choices for D(t) may bereasonable.

TBG assumes that G(t) and D(t) are independent. Werecognize that the user’s willingness to invest their timemight depend on factors such as the rate that relevant ma-terial is seen, but these factors can be accommodated inG(t). If a user decides that further interaction with the sys-tem would not be worth their while, G(t) stops increasing,

indicating that the user’s activities have moved outside ofthe scope of the system under investigation. Even if theyhave additional time to invest, the system is not worth theirinvestment. In this circumstance, we imagine the user at-tempting to satisfy their needs in some other way, ratherthan just giving up. Our mapping of the expected recipro-cal rank measure onto TBG in Section 3 provides an exampleof this type of cumulative gain.

In a sense, G(t) captures factors that are intrinsic to thesystem. If a user were willing to invest unlimited time, G(t)describes the benefit they would receive from the systemover time. On the other hand, D(t) captures factors thatare extrinsic to the system. The user only has so muchtime to spend. They cannot spend more, even if they wouldreceive substantial benefit from further interaction.

2.3 NormalizationEquation 1 includes a normalization constant, N > 0. If

cumulative gain is expressed in meaningful units, such asthe number of relevant documents recognized or time wellspent, normalization may not be required, and we setN = 1.In other cases, the normalization constant may be used tomap the value of TBG into the range [0 : 1]. In particular,many of standard evaluation measures discussed in Section 3require normalization for unification with TBG.

2.4 SimulationEquation 1 represents expected cumulative gain for a fixed

cumulative gain function G(t). To model user variance, inSmucker and Clarke [27, 28], we simulate a user populationinteracting with a traditional search interface. From thissimulation, we generate a large set of cumulative gain func-tions {G1(t), G2(t), ..., Gm(t)}, where m is the number ofsimulated users, and compute a distribution of expected cu-mulative gain values over this set. In this paper, we adopt asimilar approach, but with some important differences (Sec-tion 4).

3. BACKGROUNDTime-biased gain unifies many standard and proposed ef-

fectiveness measures from across the field of information re-trieval. Mapping these measures onto TBG exposes underly-ing assumptions about system interfaces and user behavior.By making these assumptions explicit, we may better under-stand their shortcomings and develop improvements. Whilewe previously recognized the generality of TBG [27–29], wedid not explore this generality, which provides importantbackground for the remainder of this paper.

3.1 Traditional effectiveness measuresTo map traditional effectiveness measures onto TBG, we

begin by assuming that a user has entered a query into asearch engine, and that the search engine has respondedwith a ranked list of documents. We then imagine the userworking down the ranked list, considering documents in or-der, one at a time. Gain is recognized when the user views arelevant document, making cumulative gain a step functionand reducing Equation 1 to:

1

N

∞∑k=1

gkD(T (k)), (2)

where gk represents the gain from viewing the document atrank k, and T (k) represents the time to reach rank k. If we

assume the user devotes an equal amount of time to eachdocument, we have T (k) = ck, for some constant c > 0.Substituting further reduces the equation to:

1

N

∞∑k=1

gkdk, (3)

where dk = D(ck), a rank-oriented discount value.Carterette [5] uses Equation 3 as a starting point for his

analysis of the user models underlying a number of stan-dard effectiveness measures. However, as we demonstrateabove, this equation already incorporates many unrealisticassumptions about user behavior, i.e., a linear traversal ofthe ranked list, with equal time spent at each rank, and gainrecognized all at once as each document is seen. Dependingon the measure, gain may be expressed as a binary value (i.e,a relevant document was seen or not seen) or as a gradedvalue. These graded values may be expressed in arbitraryunits [17] or as a probability of relevance [9, 20].

3.2 MappingMany standard effectiveness measures map cleanly into

Equation 3. For example, precision@k, assumes the useralways views exactly k documents, counting the number ofrelevant documents seen and normalizing by N = k. Nor-malized discounted cumulative gain [17] employs graded gainwith a log-harmonic discount, normalized by an ideal valuecomputed from the collection. Rank-biased precision [20]employs either binary or graded gain with a geometric dis-count, normalizing by an ideal value computed under theassumption that the collection contains an unlimited num-ber of relevant documents.

Expected reciprocal rank [9] assumes the user is seekinga single relevant document. In this measure, gk representsthe probability that the user will reach rank k, recognizethe document as relevant, and stop. The value for gk iscomputed from individual document relevance informationthrough a simple cascade browsing model. As its name sug-gests, the measure employs a discount of dk = 1/k and nonormalization, returning the reciprocal of the expected rankat which the user finds a relevant document.

Using Equation 3 as the starting point, Carterette [5] anal-yses these standard effectiveness measures in terms of theirunderlying assumptions about user behavior, reaching moredetailed conclusions about the interpretation of their spe-cific gain and discount values. Building on this research,Carterette et al. [6] conduct simulations of simple user be-havior, adopting the specific user behavior model underly-ing rank biased precision, and characterizing users strictlyin terms of the patience parameter incorporated into thatmeasure. In follow-up work, Carterette at al. [7] presentmethods for estimating a distribution for this parameter bymining the logs of a commercial search engine. Time-biasedgain may be viewed as an extension of these ideas with timeused as a method to appropriately account for user effort,and in the current paper, as a method for expressing gain.

Mapping average precision [24] into TBG provides inter-esting insights into the implicit assumptions underlying it.Mathematically, average precision maps cleanly into Equa-tion 3, but the result is difficult to interpret. Average pre-cision is defined as

1

R

∞∑k=1

(rk · precision@k) =1

R

∞∑k=1

rkk

k∑j=1

rj , (4)

where R is the number of relevant documents in the collec-tion, and rk is the binary relevance of the document at rankk. To map Equation 4 into Equation 3 we could set N = R,dk = 1/k, and gk = rk

∑kj=1 rj . Guided by Robertson [24],

we might imagine a user who randomly picks an integer n inthe range [1 : R] as the number of relevant documents theywould like to see, somehow knowing or guessing the numberof relevant documents in the collection. They proceed downthe list until they have seen n relevant documents, and atthat point receive a gain of n. Average precision can thenbe interpreted as the user’s expected discounted gain.

3.3 Trailtexts and the U-measureSakai and Dou’s [25] trailtexts and their associated U-

measure provided inspiration for our TWS measure. Givena trailtext of length l, the U-measure is computed as

1

N

l∑k=1

gkdk. (5)

Just as Equation 3 sums over documents in the ranked list,the U-measure sums over character offsets in the trailtext.The value gk represents the graded relevance value of thetext at offset k, and the value dk represents a position-oriented discount. For their experiments, Sakai and Douuse a linear discount and no normalization, but clearly themeasure could be generalized to other discount and normal-ization methods.

Unlike Equation 3, the model of the user underlying Equa-tion 5 does not assume that the user works linearly througha returned result. Even under a traditional search interface,the user might read a caption, skip ahead to the next cap-tion, re-read the previous caption, click on the link, read thedocument for awhile, return to the results, and so on, gener-ating a trailtext as they go. Nonetheless, Equation 5 shareswith traditional evaluation measures many of the same as-sumptions about user behavior, specifically that all usersread at the same fixed rate and that overhead associatedwith the user interface can be ignored.

3.4 Other effectiveness measuresAlthough most traditional effectiveness measures assume

a ranked list of documents, efforts have been made to de-velop effectiveness measures to evaluate passage retrievaland other focused retrieval tasks. For example, the INEXseries of experiments [19, 21] studies XML retrieval, wherethe goal is to return a ranked list of document components,such as paragraphs, sections, subsections, etc. Over theyears, the INEX organizers have struggled to develop effec-tiveness measures appropriate to this task. Similar struggleswere reported by the organizers of the TREC 2004 HARDTrack [1], as detailed in Section 5.4. While we concentrateon the HARD Track in this paper, we believe that TWS canform an appropriate basis for the evaluation of other focusedtasks, which we leave for future work.

For video and XML Retrieval, de Vries et al. [12] suggestan effectiveness measure based on the user’s willingness totolerate non-relevant material, which foreshadows many ofthe ideas underlying TBG.

Although we omit the details, measures proposed to evalu-ate novelty and diversity [10] fit cleanly into the TBG frame-work. We note, however, that not all effectiveness measuresfit cleanly into the TBG framework. See, for example, theabsence time measure of Dupret and Lalmas [14].

Similar to Sakai and Dou’s trailtexts [25], Azzopardi [2]earlier demonstrated the idea of evaluation being based onthe stream of documents examined by a user. Baskaya etal. [3] have shown the importance of measuring gain overtime and avoiding normalization of gain curves. Many oth-ers have also worked to improve IR evaluation through themodeling of user behavior. The MUBE 2013 workshop re-port cites a large number of the works in this area [11].

4. TIME WELL SPENTTo compute time well spent, we assume a simulated user

can be described by parameters θ drawn from a distributionover a parameter space Θ. After drawing a sample valuefor θ, we simulate this user interacting with the system,generating a trace of their activities, until they stop at times. We represent this stopping time by a random variableS with a distribution derived from the survival probability,fS(t) = −D′(t).

A trace is our analogue of a trailtext as defined by Sakaiand Dou [25], but expressed in terms of time, rather thancharacters, facilitating a user-oriented analysis. By repeat-edly drawing samples from both distributions and simulatinguser interaction, we can generate an arbitrarily large set oftraces. Each trace can be analyzed to determine the user’stime well spent. For the purposes of this paper, we assumethat time is either well spent or wasted — our analogueof binary relevance. We leave considerations of time bet-ter spent — our analogue of graded relevance — for futurework. The final output from the simulation is a set of pairs{(s1, w1), (s2, w2), ..., (sm, wm)}, one for each of m simulatedusers, where in each pair si indicates the stopping time forthe trace i, and wi indicates the time well spent. We alsodefine temporal precision as wi/si.

It is important to recognize that these pairs are not “sam-ples” in the statistical sense. Taken together, the distribu-tions for Θ and S, along with the simulation of the systemitself, define distributions for time well spent and temporalprecision. Increasing m increases the precision of our nu-merical approximations for these distributions, but it is notincreasing our “sample size” in any sense. The only “sam-ple size” used in this paper is the number of queries (seeSection 7).

However — just as Sakai and Dou propose for trailtexts —we could also generate traces from real user interactions,perhaps by running laboratory experiments or mining thelogs of a commercial search engine. Such traces would besamples in the statistical sense, requiring statistical methodsdifferent from those applied to the experiments reported inthe remainder of this paper. We leave the exploration ofthese ideas for future work.

5. EXEMPLARNo two users are identical. Users have different interests,

they have different abilities, they work at different rates, andthey have different amounts of time available. To adapt thegeneral framework of time-biased gain to a new context, wemust start with its user interface and model how simulatedusers interact with this interface. This simulated interactionmust include user actions, user effort, and user gain.

As an example, we consider what is perhaps the simplestuser interface possible for an information retrieval system,simpler even than the traditional ranked list. In response to

0 5 10 15

0.00

0.05

0.10

0.15

0.20

Words Read / Second

Rel

ativ

e F

requ

ency

Den

sity

Figure 1: Empirical distribution of reading speedand its fit to a log-normal distribution.

a query, the system returns a stream of text to the user. Theuser interface allows for a single query and no reformulationis possible. Besides accepting a query and returning a streamof text, no other functionality is provided.

Mirroring the simplicity of the user interface, we modelinteraction with this interface in a very simple manner. Thesimulated user starts reading the stream of text at time zero.As the user reads, some of the material is relevant, and overthat material gain accumulates. Over non-relevant material,gain plateaus. After reading for some time, the simulateduser stops reading the stream of text and interaction withthe system is finished. The time the user spends readingrelevant material is our measure of interest, i.e., time wellspent.

Even under this simplest of interfaces, the computation oftime well spent requires us to model reading speeds and stop-ping times, which can be based on real user data. Despiteits simplicity, this instantiation of TWS may be applied to avariety of contexts where an information retrieval system at-tempts to return a focused result, including contexts such aspassage retrieval, summarization, and question answering.As a specific example, we turn to the TREC 2004 HARDTrack [1], which explored passage retrieval as a method todirect a user’s attention to relevant material.

5.1 User interaction modelCranfield-style evaluation has typically modeled users’ dif-

ferent information needs by creating a set of search topics,with each topic typically providing a query and an associ-ated description of an information need. Each topic belongsto a single user, where in TREC parlance, these users areassessors. The assessor who creates the search topic alsojudges and records the relevance of material with respectto the search topic. While it would be possible to recordmore information about each assessor, our work here as-sumes that such information does not exist, which is thenorm for TREC test collections. Without knowledge abouteach assessor’s reading rate or time available to work, wemust estimate these aspects of user behavior.

We simulate a user population that has both a distributionof reading speeds and a distribution of stopping times. Wemodel both the reading speed and stopping times as randomvariables with log-normal probability distributions. The log-normal distribution commonly provides a good fit to humanperformance data [13]. Data that is fit well by a log-normal

distribution is data for which the log of the data values forma normal distribution.

Traditional effectiveness measures assume users move downa ranked list of material at a constant rate. We could alsouse a single reading speed and a single time spent reading tocompute time well spent, but doing so would greatly limitthe benefits to be had from modeling user behavior. For ex-ample, by simulating different levels of user patience in therank-biased precision model of user behavior, Carterette etal. [6] found that different retrieval results would be bestfor different types of users. In addition to seeing whichtypes of users do better with which sets of retrieval results,Carterette et al. showed that by including the variability ofusers into our comparisons of systems, we may reach differ-ent conclusions about the significance of differences betweensystems.

5.2 Reading speedWe fit the reading speed distribution using data collected

as part of a user study [30]. In this user study, 48 partic-ipants individually read and judged the relevance of shortdocument summaries (snippets). Each participant in theuser study judged summaries for 4 TREC search topics. Foreach participant and search topic, we sum up the numbers ofwords in the summaries judged and divide by the time takento judge them. Thus, for each user, we obtain 4 different es-timates of their reading speed. Taking all 48 × 4 = 192reading speed estimates as our distribution of reading speedgives us the distribution shown in Figure 1 and its fittedlog-normal distribution.

A log-normal distribution can be described in terms ofthe mean and standard deviance of the normal distributionthat fits the natural log of the data. For our reading speeddata, a maximum likelihood fit gives us a mean of 1.29 (4.3words per second) and a standard deviation of 0.558. In astudy of 22 college students’ reading rates, Hewitt et al. [16]found students to read at an average rate of 3.97 words persecond (wps) and minimum and maximum rates of 1.79 and6.39 wps, respectively. Our distribution of reading speedis in line with the results of Hewitt et al. except that wehave several instances of reading speeds significantly greaterthan the maximum found by Hewitt. For our experiments onthe TREC 2004 HARD Track data, we convert from wordsper second to characters per second with a conversion ratioof 6.16 characters per word. We determined the averagenumber of characters per word by tokenizing all relevantpassages into contiguous tokens of alphanumeric charactersto obtain a word count for the passage and then dividing bythe number of characters in the original passage.

5.3 Stopping timesFor the stopping time distribution, we re-analyzed the

data set used in Smucker and Clarke [29], which was minedfrom the logs of a commercial search engine. As we did pre-viously [29], we restrict our analysis to users who click onfive or more results — avoiding most navigational queries —and search for less than half an hour — avoiding extremeoutliers. We record the time from first search to last clickas the duration of time spent searching.

In Smucker and Clarke [29], we fit an exponential functionto model the fraction of the user population that contin-ued to search at a given time, i.e., the survival probabilityD(t). While exponential decay provides an adequate fit to

0 500 1000 1500

0.00

000.

0010

0.00

200.

0030

Time Spent Searching (Seconds)

Rel

ativ

e F

requ

ency

Den

sity

Figure 2: Empirical distribution of time spentsearching and its fit log-normal distribution. Eachcircle represents the proportion of users whosearched for the given number of seconds, i.e. thedistribution has 1-second wide bins.

the survival probability, the underlying distribution of datais better fit as a log-normal. Figure 2 shows the empiricaldistribution and maximum likelihood fit log-normal for thetime the web users spent searching. The mean and standarddeviation of the normal distribution fit to the log of the timesmeasured in seconds are 5.32 and 0.965 respectively. To gen-erate a random deviate from a log-normal distribution, wecompute exp(µ + σu) , where µ and σ are the mean andstandard deviation of the normal distribution that is fit tothe log of the data, and u is a random deviate drawn froma normal distribution with a mean of 0 and a variance of 1.

5.4 TREC 2004 HARD TrackAlong with other goals, the TREC 2004 HARD Track [1]

explored passage retrieval as a method to direct a user’s at-tention to relevant material. The track included 25 topics forwhich relevance judging was performed at the passage level.For each judged document, the byte offsets and lengths ofrelevant passages within the document were identified.

The corpus for the track comprised more than a half-a-million news articles from the year 2003, drawn from a va-riety of sources, including the Associated Press, the NewYork Times, and the Washington Post. Systems returned aranked list of passages for each of the 25 passage-level top-ics, where each passage consisted of a document, a startingbyte offset within the document, and a length in bytes. Agiven document could appear multiple times in the rankedlist, with a different passage each time. Nothing preventedthese passages from overlapping and containing duplicatecharacters.

The track organizers experienced difficulties in their at-tempts to adapt standard effectiveness measures to passage-oriented retrieval. For example, they defined the measurepassage R-precision as the character-level precision up topassage R, where R is the number of relevant passages iden-tified by the TREC assessors. However, as they note, merelysplitting existing passages into smaller pieces can dramati-cally improve this measure.

In view of these difficulties, a passage-oriented variant ofthe bpref measure [4] was developed for use as the track’sprimary effectiveness measure [31]. This passage-level bprefmeasure is computed by concatenating the first 12,000 char-

acters of the returned passages and determining the pro-portion of non-relevant characters appearing before relevantcharacters. In essence, this measure assumes that the userreads the passages in order, stopping after they read exactly12,000 characters, no more and no less. Our instantiation ofTWS for this context shares the assumption that the userreads the passages in order, but explicitly considers read-ing speed and stopping time. In computing TWS over theTREC 2004 HARD track data, we assume that reading du-plicated characters is never time well spent.

6. SINGLE TOPIC TIME WELL SPENTTo illustrate TWS, we examine simulation data produced

for a single TREC 2004 HARD topic (HARD-424: “Bolly-wood”) from a single run (york04ha1). Following the pro-cedure described in Section 4 and Section 5, we generatedtraces for m = 10, 000 users and computed TWS over thesetraces. We analyzed these results to produce the six plotsshown in Figure 3.

Traditional View. Figure 3a shows the cumulative gain asmeasured in number of relevant characters read vs. the totalnumber of characters read. Each dot in the plot representsone of the 10,000 simulated users. The plot shows that forthis run on this topic, there is initially a segment of relevantmaterial and then there are some sections of non-relevantmaterial. It is the non-relevant material that produces theplateaus in the cumulative gain. As is to be expected, onlya few users read fast enough and long enough to reach deepinto the stream of returned material. If we were to zoom inon this plot and only show the first several thousand char-acters read, we would see that gain increases character bycharacter, but the differing scales of the x and y axes doesnot make this apparent in the plot.

While the data points in Figure 3a are the data pointsproduced by our instantiation of time well spent, viewingthe results in this manner does not reflect the way we havemodeled user behavior. We have modeled users with dif-ferent reading rates and different times spent reading. Thecorrect horizontal axis for us to use is the amount of timespent reading, which is the meaningful measure of cost forthe user experience captured by time well spent.

Time as Cost. Figure 3b shows the gain in relevant char-acters vs. time spent reading. In this plot we now see thevariability produced by users with different reading ratesand amounts of time spent reading. For a given amountof time spent reading, we have users obtaining many dif-ferent amounts of gain. Some users read faster than othersand thus users can read different amounts of material in thesame time, which leads to different amounts of gain. Whilethis plot is an improvement over the first, measuring gain inthe number of relevant characters does not match the userexperience modeled by time well spent. The correct measureof gain is the amount of time well spent, i.e., the amount oftime reading relevant material.

Time as Cost and Gain. Figure 3c correctly shows us theuser experience modeled by time well spent. As users investmore time in reading, more of their time can be well spentreading relevant material. The spread of points along thex-axis (time spent reading) comes directly from the distri-

Figure 3: Detailed analysis of topic HARD-424 (“Bollywood”) for a TREC 2004 HARD Track run (york04ha1)from several different perspectives. In plots a-d, each point represents one of 10,000 simulated users. Thesesimulated users read at different rates and for different amounts of time. Figure 3a shows how traditionaleffectiveness measures might view the evaluation: gain in relevant material vs. material consumed. Figure 3bshows gain in relevant material vs. time spent reading. In contrast, Figure 3c shows time well spent vs. timespent, and Figure 3d shows temporal precision vs. time spent. In these two figures, both of the axes are inunits reflective of the user experience being evaluated and illustrate the experience of the user population.Figures 3e and 3f show the distribution of the dependent variables of Figures 3c and 3d, respectively. Theseplots are described in further detail in Section 6.

bution of time spent reading shown in Figure 2. The spreadof points along the y-axis (time well spent) comes from thedifferent reading rates of Figure 1 combined with the uniquesequence of material return by the retrieval system.

In addition to time well spent, another natural measureto compute is time well spent divided by the time spentreading, which we call temporal precision. Figure 3d showsthe temporal precision vs. the amount of time spent reading.

Distribution of Gain. While it is interesting to view timewell spent vs. time spent reading, we care about the distri-bution of time well spent for our simulated user population.Figure 3e shows a histogram of the time well spent for thissystem on this topic. Likewise, Figure 3f shows a histogramof temporal precision for this run and topic.

While space precludes the inclusion of the plot, the distri-bution of gain as measured in number of relevant charactershas a significantly different distribution than the distributionof gain measured as time well spent. If the user experiencewe are modeling is the amount of time well spent, we needto directly estimate the amount of time well spent.

The majority of cumulative gain effectiveness measuresproduce a single number for the performance of a systemon a topic. For some measures, the single number is an ex-pected value of gain given a distribution of hypothesized userbehavior, e.g., a distribution of ranks reached given a modelof user persistence. Rather than collapse the distributionof gain, we maintain it to enhance our ability to compareperformance between systems. Our set of simulated usersallows us to easily numerically approximate the resultingdistribution of gain.

7. WHOLE RUN EVALUATIONIn this section, we examine how to summarize perfor-

mance across a set of topics and how to compare perfor-mance across a pair of systems. We take a user-focusedapproach to evaluation and estimate both user performanceand the fraction of users that have a better experience givenone system compared to another.

Our simulated users have different reading speeds andspend different amounts of time reading. These user dif-ferences will result in different amounts of time well spentfor each user on each topic. It is at the level of each sim-ulated user that we want to compute measures of effective-ness. Thus, for each user we compute that user’s effective-ness across the set of topics. We are not interested in com-puting a per-topic summary of performance for each topicand then examining that distribution of topic scores. Wewant to understand how a given user performs across thetopics in the test collection and produce a single distribu-tion of user performance that summarizes the run.

How should we compute a user’s performance across a setof topics? An obvious choice is to compute the mean timewell spent. Other summary statistics are possible, such asthe median. Unfortunately, the mean may not properly re-flect the user’s experience across a set of topics. For exam-ple, consider a user that has a time well spent of 30 secondson all topics compared to one that has 60 seconds of timewell spent on half the topics and 0 seconds well spent onthe other half. Both users have the same average amount oftime well spent, but it is possible that many users would pre-fer consistent performance across topics over highly variableperformance. The mean is unable to distinguish between

the two experiences. Nevertheless, we will use the mean tosummarize time well spent across topics for a user, and weleave for future work the study of how best to summarize auser’s performance across a set of topics.

7.1 Comparing runsInformation retrieval evaluation rarely involves reporting

the performance of a single system. In most cases, we areconcerned with comparing one system A to another systemB. When effectiveness measures produce a single value foreach topic, we examine the pairs of performance values forsystems A and B at the topic level. Working with matchedpairs allows us to better detect the significance of differencebetween systems because a large amount of the variabilityin the effectiveness measure is caused by the inherent differ-ences between topics. By creating matched pairs, we controlfor topic variability and have an increased ability to detectchanges in performance between the two systems. Creatingmatched pairs is also know as blocking our data; we formblocks where an independent variable has the same valueacross systems.

We produce a distribution of time well spent for each sys-tem on each topic. For each user that we have simulated, wehave that user’s performance measured on each system andtopic. Thus, we block our data at the topic and user level.For each user, we have that user’s performance on system Afor topic i and match it with that user’s performance on sys-tem B for topic i. As per our discussion above, we now needto compute a statistic that reflects the user’s experience onsystem A vs. system B across the set of topics.

We use two different statistics to reflect user experienceacross topics. Our first statistic is the mean difference acrossthe topics. Given a set of topics t = (t1, t2, · · · , tn), themean difference in time well spent for a user i is:

µi =1

|t|∑t∈t

(wAti − wBti) , (6)

where wAti is the time well spent on system A and topic tfor user i, and wBti is the time well spent on system B forthe same topic and user. Our second statistic is the fractionof topics for which the user prefers system A to system B,which we call the preference ratio. We say that a user prefersA to B if the user’s amount of time well spent is greater onA than on B. If the user has the same amount of timewell spent on both systems, then a tie occurs. We break tiesevenly and divide one unit of preference in half and give it toboth systems. The preference ratio for a user i is computedas:

ri =1

|t|∑t∈t

Pref(wAti, wBti) , (7)

where the Pref function is:

Pref (wAti, wBti) =

1 if wAti > wBti ,0.5 if wAti = wBti ,0 if wAti < wBti .

(8)

The preference ratio indicates the probability that a userwill prefer system A to system B and does not suffer fromthe problem that the mean difference has with strong perfor-mance on some topics compensating for weak performanceon others. Both statistics produce a distribution of valuesthat reflects the user experience across the population withsystem A vs. system B. For example, the distribution of the

preference ratios is r = (r1, r2, · · · , rm) for the m simulatedusers. From these distributions, we can then compute a finalsummary statistic such as the mean of the distribution, e.g.,the mean average difference in time well spent and the meanpreference ratio.

We use bootstrap sampling [15] to compute a confidenceinterval for this summary statistic that not only tells us therange of possible values for it, but also whether or not thedifference between the systems is statistically significant.To compute a bootstrap confidence interval, we resamplethe topics repeatedly and recompute the summary statis-tic. Each resampling of the topics and computation of thesummary statistic contributes a sample to what is knownas the bootstrap distribution. After collecting B bootstrapsamples, we take the 100α percentile of the bootstrap dis-tribution as the lower end, and the 100(1− α) percentile asthe upper end of the 1− 2α percent confidence interval. Forexample, if we want to compute a 95% confidence interval,α = (1 − 0.95)/2 = 0.025, and we use the 2.5th and the97.5th percentiles of the bootstrap distribution as the lowerand upper bounds of the interval. A 95% confidence intervaltells us that with 0.95 probability that the true value of thesummary statistic falls within the interval. In the case ofthe mean average difference in time well spent, if the 95%confidence interval does not cross 0, then we can say thatthe difference is statistically significant at the p < 0.05 level.

It is important to note that our testing of statistical sig-nificance does not depend on the number of simulated users.Our simulated users are used to estimate the user popula-tion’s distribution of performance. Increasing the number ofsimulated users only increases the precision of our approx-imation of the distribution of time well spent. The samplesize of our experiment remains the number of topics.

To illustrate, we take the york04ha1 run used previouslyin Section 6 and also the UWATexp2 run. Both of theseruns provided passage results for the 25 passage topics ofthe TREC 2004 HARD track. The mean time well spent foryork04ha1 and UWATexp2 is 118 seconds and 82 seconds,respectively. Time well spent estimates that, on average,users will spend 36 seconds longer viewing relevant mate-rial with york04ha1 than with UWATexp2. We computedthe bootstrap distribution with B = 10, 000 samples andfound that the 95% confidence interval for the mean aver-age difference in time well spent between these two systemsis from 2.8 seconds to 71 seconds. Thus, the york04ha1 sys-tem has a statistically significant performance improvementin mean time well spent over UWATexp2 at the p < 0.05level, but this performance improvement may only be 2.8seconds, which is not of particular human significance com-pared to the mean time well spent of 118 seconds. To tightenthe confidence interval, we would need to increase the num-ber of topics, i.e. increase the sample size.

Our simulated user population has a mean preference ratioof 0.65 for york04ha1 over UWATexp2. A preference ratio of0.65 means that with 0.65 probability, a randomly selecteduser on a randomly select topic will prefer york04ha1 toUWATexp2. A preference ratio of 0.5 would indicate nopreference for either system. For these two systems, the 95%confidence interval for the mean preference ratio extendsfrom 0.52 to 0.77, which is statistically significant at thep < 0.05 level, for the interval does not cross the 0.5 level.Similar to the mean average difference in time well spent,it is possible that the mean preference ratio advantage of

york04ha1 is quite small, 0.52, and thus there may be littleperformance difference between the systems.

7.2 Comparing One Run to ManyJust as we compared york04ha1 to UWATexp2, we can

compare york04ha1, UWATexp2, or any other run to allother HARD 2004 passage runs. When making a compari-son, we designate the run of interest as system A and each ofthe other runs in turn as system B. When we find a positivedifference in time well spent, system A is better than systemB. When we find a preference ratio greater than 0.5, systemA is better than system B.

The left-hand plot in Figure 4 shows the mean preferenceratio vs. the mean average difference in time well spent foryork04ha1 vs. the other passage runs. Each point representsthe scores that york04ha1 receives when compared to otherruns. This plot shows that york04ha1 has a greater meantime well spent than all other runs, and that york04ha1 has apreference ratio greater than 0.5 compared to all other runs.Also shown in the plot are the 95% confidence intervals foreach of these results. When a confidence interval crosses onethe dashed lines in the plot, that result is not statisticallysignificant at the p < 0.05 level. As we can see, it isn’t untilthere is around a 30 second difference in time well spentand a preference ratio of around 0.65 that runs start beingsignificantly different than york04ha1. If we were to placeyork04ha1 on this plot, it would have a zero difference intime well spent from itself and a preference ratio of 0.5, i.e.it would be at the intersection of the dashed lines.

The right-hand plot in Figure 4 shows UWATexp2 vs. theother runs. As with the plot for york04ha1, if we were toplace UWATexp2 on this plot, it would be located at the in-tersection of the dashed lines. The runs with negative meanaverage differences in time well spent are runs that delivergreater amounts of time well spent compared to UWATexp2.Only one of these better runs is better by a statistically sig-nificant amount in both measures, and this better run isyork04ha1. The majority of the runs do not have statisti-cally significant differences from UWATexp2.

8. CONCLUSIONIn this paper, we created a new instantiation of time-

biased gain designed to measure the effectiveness of retrievalsystems where the value of a user’s experience increases withthe amount of time well spent. In doing so, we have pro-duced an effectiveness measure where both cost and gainare in meaningful units. A user’s cost is the amount of timespent searching, and a user’s gain is the amount of timespent consuming relevant material. As an exemplar, we usedthe TREC 2004 HARD Track with search topics that hadpassage judgments. For each topic, we simulate a user popu-lation and produce the population’s distribution of time wellspent (TWS). Across multiple topics and between pairs ofsystems, we showed how to compute the distribution of thepopulation’s average paired difference in time well spent aswell as a preference ratio that tells us the probability thata randomly selected user and topic will prefer one systemto another. Finally, we also showed how to use bootstrapsampling to compute confidence intervals on TWS and itsassociated statistics. Our methodology to compute confi-dence intervals for TWS, is applicable to not only TWS, butalso to other effectiveness measures that produce per-topicdistributions of gain.

−50 0 50 100

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Run york04ha1 Compared to Other Runs

Mean Average Difference in Time Well Spent (seconds)

Mea

n P

refe

renc

e R

atio

−50 0 50 100

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Run UWATexp2 Compared to Other Runs

Mean Average Difference in Time Well Spent (seconds)

Mea

n P

refe

renc

e R

atio

Figure 4: This figure shows the TREC 2004 HARD Track runs york04ha1 and UWATexp2 compared to allother passage runs. Each point is a comparison. The bars are 95% confidence intervals (see Section 7.2).

9. ACKNOWLEDGMENTSWe thank Ellen Voorhees for her assistance in obtaining

the TREC HARD track data. This work was supportedin part by the Natural Sciences and Engineering ResearchCouncil of Canada (NSERC), in part by GRAND NCE, inpart by Google, and in part by the University of Waterloo.

10. REFERENCES[1] J. Allan. HARD track overview in TREC 2004: High

accuracy retrieval from documents. In TREC, 2004.

[2] L. Azzopardi. Usage based effectiveness measures:monitoring application performance in informationretrieval. In CIKM, pp. 631–640, 2009. ACM.

[3] F. Baskaya, H. Keskustalo, and K. Jarvelin. Time drivesinteraction: Simulating sessions in diverse searchingenvironments. In SIGIR, pp. 105–114, 2012.

[4] C. Buckley and E. M. Voorhees. Retrieval evaluation withincomplete information. In SIGIR, pp. 25–32, 2004.

[5] B. Carterette. System effectiveness, user models, and userutility: A conceptual framework for investigation. InSIGIR, pp. 903–912, 2011.

[6] B. Carterette, E. Kanoulas, and E. Yilmaz. Simulatingsimple user behavior for system effectiveness evaluation. InCIKM, pp. 611–620, 2011.

[7] B. Carterette, E. Kanoulas, and E. Yilmaz. Incorporatingvariability in user behavior into systems based evaluation.In CIKM, pp. 135–144, 2012.

[8] B. Carterette. Multiple testing in statistical analysis ofsystems-based information retrieval experiments. TOIS,30(1):4:1–4:34, March 2012.

[9] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan.Expected reciprocal rank for graded relevance. In CIKM,pp. 621–630, 2009.

[10] C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. Acomparative analysis of cascade measures for novelty anddiversity. In WSDM, pp. 75–84, 2011.

[11] C. L. A. Clarke, L. Freund, M. D. Smucker, and E. Yilmaz.Report on the SIGIR 2013 workshop on modeling userbehavior for information retrieval evaluation (MUBE2013). SIGIR Forum, 47(2):84–95, January 2013.

[12] A. P. de Vries, G. Kazai, and M. Lalmas. Tolerance toirrelevance: A user-effort oriented evaluation of retrievalsystems without predefined retrieval unit. In RIAO, 2004.

[13] G. Doherty, M. Massink, and G. Faconti. Reasoning aboutinteractive systems with stochastic models. In ChrisJohnson, editor, Interactive Systems: Design, Specification,and Verification, vol 2220 of LNCS, pp. 144–163. 2001.

[14] G. Dupret and M. Lalmas. Absence time and userengagement: Evaluating ranking functions. In WSDM, pp.173–182, 2013.

[15] B. Efron and R. J. Tibshirani. An Introduction to theBootstrap. Chapman & Hall/CRC, 1998.

[16] J. Hewitt, C. Brett, and V. Peters. Scan rate: A newmetric for the analysis of reading behaviors inasynchronous computer conferencing environments. Amer.J. of Distance Ed., 21(4):215–231, 2007.

[17] K. Jarvelin and J. Kekalainen. Cumulated gain-basedevaluation of IR techniques. TOIS, 20(4):422–446, 2002.

[18] D. Kelly and C. R. Sugimoto. A systematic review ofinteractive information retrieval evaluation studies,1967-2006. JASIST, 62(4):745–770, April 2013.

[19] M. Lalmas and A. Tombros. Evaluating XML retrievaleffectiveness at INEX. SIGIR Forum, 41(1):40–57, June2007.

[20] A. Moffat and J. Zobel. Rank-biased precision formeasurement of retrieval effectiveness. TOIS, 27(1):1–27,2008.

[21] B. Piwowarski, A. Trotman, and M. Lalmas. Sound andcomplete relevance assessment for XML retrieval. TOIS,27(1):1:1–1:37, December 2008.

[22] F. Radlinski and N. Craswell. Optimized interleaving foronline retrieval evaluation. In WSDM, pp. 245–254, Rome,2013.

[23] S. Robertson. On GMAP: And other transformations. InCIKM, pp. 78–83, 2006.

[24] S. Robertson. A new interpretation of average precision. InSIGIR, pp. 689–690, 2008.

[25] T. Sakai and Z. Dou. Summaries, ranked retrieval andsessions: A unified framework for information accessevaluation. In SIGIR, pp. 473–482, 2013.

[26] M. D. Smucker, J. Allan, and B. Carterette. A comparisonof statistical significance tests for information retrievalevaluation. In CIKM, pp. 623–632, 2007.

[27] M. D. Smucker and C. L. A. Clarke. Modeling uservariance in time-biased gain. In HCIR, pp. 3:1–3:10, 2012.

[28] M. D. Smucker and C. L. A. Clarke. Stochastic simulationof time-biased gain. In CIKM, pp. 2040–2044, 2012.

[29] M. D. Smucker and C. L. A. Clarke. Time-based calibrationof effectiveness measures. In SIGIR, pp. 95–104, 2012.

[30] M. D. Smucker and C. Jethani. Human performance andretrieval precision revisited. In SIGIR, pp. 595–602, 2010.

[31] C. Wade and J. Allan. Passage retrieval and evaluation.Technical Report IR-396, Center for Intelligent InformationRetrieval (CIIR), Department of Computer Science,University of Massachusetts Amherst, 8 pages, 2005.

Time Well Spentmsmucker/publications/clarke-s… · of document ranking algorithms [1,4,18,24,30]....

Documents

Transcript of Time Well Spentmsmucker/publications/clarke-s… · of document ranking algorithms [1,4,18,24,30]....