Planning, Running, and Analyzing Controlled Experiments on the Web

Planning, Running, and Analyzing Controlled Experiments on the Web Ronny Kohavi, Microsoft

Slides available at http://exp-platform.com

Part 3 of

http://exp-platform.com/

2Puzzling OutcomesWrap-up section based on KDD 2012 paper, co-authored by Ronny Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya XuHow does one determine the OEC for a search engine?What are some of the most surprising results we faced, and how did we resolve them

3Puzzle 1: OEC for SearchAn OEC is the Overall Evaluation CriterionIt is a metric (or set of metrics) that guides the org as to whether A is better than B in an A/B testIn prior work, we emphasized long-term focus and thinking about customer lifetime value, but operationalizing it is hardSearch engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goalsPuzzle

A ranking bug in an experiment resulted in very poor search resultsDistinct queries went up over 10%, and revenue went up over 30%What metrics should be in the OEC for a search engine?

4Puzzle 1 ExplainedDegraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevantAnalyzing queries per month, we have

where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions).

Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smallerIn a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equalThe OEC should therefore include the middle term: sessions/user

5Puzzle 2: Click TrackingA piece of code was added, such that when a user clicked on a search result, additional JavaScript was executed(a session-cookie was updated with the destination)before navigating to the destination pageThis slowed down the user experience slightly, so we expected a slightly negative experiment.Results showed that users were clicking more!

Why?

6Puzzle 2: Click Tracking - BackgroundUser clicks (and form submits) are instrumented and form the basis for many metricsInstrumentation is typically done by having the web browser request a web beacon (1x1 pixel image)Classical tradeoff here

Waiting for the beacon to return slows the action (typically navigating away)Making the call asynchronous is known to cause click-loss, as the browsers can kill the request (classical browser optimization because the result can’t possibly matter for the new page)

Small delays, on-mouse-down, or redirect are used

7Puzzle 2: Click Tracking ExplainedClick-loss varies dramatically by browserChrome, Firefox, Safari are aggressive at terminating such reqeuests. Safari’s click loss > 50%.IE respects image requests for backward compatibility reasonsWhite paper available on this issue hereOther cases where this impacts experiments

Opening link in new tab/window will overestimate the click deltaBecause the main window remains open, browsers can’t optimize and kill the beacon request, so there is less click-lossUsing HTML5 to update components of the page instead of refreshing the whole page has the overestimation problem

http://www.exp-platform.com/Pages/TrackingClicksSubmits.aspx

8Background: Primacy and Novelty Effects

Primacy effect occurs when you change the navigation on a web site

Experienced users may be less efficient until they get used to the new navigationControl has a short-term advantage

Novelty effect happens when a new design is introduced

Users investigate the new feature, click everywhere, and introduce a “novelty” bias that dies quickly if the feature is not truly usefulTreatments have a short-term advantage

9Puzzle 3: Effects TrendGiven the high failure rate of ideas, new experiments are followed closely to determine if new idea is a winnerMultiple graphs of effect look like this

Negative on day 1: -0.55%Less negative on day 2: -0.38%Less negative on day 3: -0.21%Less negative on day 4: -0.13%

The experimenter extrapolates linearlyand says: primacy effect. This will be positive in a couple of days, right?Wrong! This is expected

8/30/2011

8/31/2011

9/1/2011

9/2/2011

9/3/2011

9/4/2011

-1.20%128.80%258.80%388.80%518.80%648.80%778.80%908.80%

Cumulative Effect

10Puzzle 3: Effects TrendFor many metrics, the standard deviation of the mean is proportional to , where is the number of usersAs we run an experiment longer, more users are admitted into the experiment, so grows and the conf interval shrinksThe first days are highly variableThe first day has a 67% chanceof falling outside the 95% CIat the end of the experimentThe second day has a 55% chanceof falling outside this bound.

0 5 10 15 20

-0.80%-0.60%-0.40%-0.20%0.00%0.20%0.40%0.60%0.80%

95% bound 21-day bound

Experiment Days

Effec

t

11Puzzle 3: Effects TrendThe longer graph

This was an A/A test, so the true effect is 0

-1.40%73.60%

148.60%223.60%298.60%373.60%448.60%523.60%598.60%673.60%748.60%823.60%898.60%973.60%

Cumulative Effect

12Puzzle 4: Statistical PowerWe expect the standard deviation of the mean (and thus the confidence interval) to be proportional to , where is the number of usersSo as the experiment runs longer and more users are admitted, the confidence interval should shrinkHere is the graph for sessions/user

X-axis: Treatment sizeY-axis: conf intervalThree lines: 1,2,3 weeks

Overlapping lines?That’s the problem! 0.5 2 3.5 5 6.5 8 9.5 11 12.5 14 15.5 17 18.5 200

1

2

1 week 2 weeks 3 weeksSize of Treatment (relative factor)Co

nfide

nce

Iner

val

Wid

th (

perc

ent)

13Puzzle 4: Statistical PowerThe distribution changes

Users churn, so they contribute zero visitsNew users join with fresh count of oneWe have a mixtureEmpirically, the coefficient of variation (ratio of the standard deviation to the mean) grows at the same rate as

Running an experiment longer does not increase statistical power for some metrics; you must increase the variant size

14Puzzle 5: Carryover EffectsExperiment is run, results are surprising.(This by itself is fine, as our intuition is poor.)Rerun the experiment, and the effects disappearReason: bucket system recycles users, and the prior experiment had carryover effectsThese can last for months!Must run A/A tests, or re-randomize

15SummaryOEC: evaluate long-term goals through short-term metricsThe difference between theory and practice is greater in practice than in theory

Instrumentation issues (e.g., click-tracking) must be understoodCarryover effects impact “bucket systems” used by Bing, Google, and Yahoo require rehashing and A/A tests

Experimentation insight:Effect trends are expectedLonger experiments do not increase power for some metrics. Fortunately, we have a lot of users

Planning, Running, and Analyzing Controlled Experiments on the Web

Documents

Transcript of Planning, Running, and Analyzing Controlled Experiments on the Web