Sponsored By:

© 2009, Microsoft Corporation

Sponsored By:

Top 7 Testing Pitfalls Presented live November 18, 2009

Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation Platform

Admin Note:Attendees will also get a copy of these slides + an On-demand mp3 of this via email on Thursday afternoon November 19th


First: Why Bother Testing?-> ‘Best Practices’, standard Web design templates, and

marketer’s “gut” often FAIL tests.

-> For previously untested sites, testing gives an average ~ 40% conversion lift.

-> Tests can help you generate better quality leads or sales – not just more conversions.

WhichTestWon.com


Agenda• Intro & controlled experiments in one slide• Examples: you’re the decision maker• Seven pitfalls• Q&A

Pitfalls based on KDD 2009 paper: http://exp-platform.com/ExPpitfalls.aspx by Thomas Crook, Brian Frasca, Ronny Kohavi, and Roger Longbotham

3


Our Experience at Microsoft• The Experimentation Platform started at Microsoft in 2006• Experiments ran on 20 Microsoft properties, including MSN

home pages in several countries, MSN Money, MSN Real estate, www.microsoft.com, store.microsoft.com, support.microsoft.com, Office Online, www.xbox.com, several marketing sites, and Windows Genuine Advantage

• Large experiments run with tens of millions of users• Multiple experiments have projected annual improvements of

over $1M each


Controlled Experiments in One Slide

• Concept is trivial– Randomly split traffic between

two (or more) versions• A (Control)• B (Treatment)

– Collect metrics of interest– Analyze

5

100%Users

50%Users

50%Users

Control:Existing System

Treatment:Existing System with Feature X

Users interactions instrumented, analyzed & compared

Analyze at the end of the experiment

Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)Must run statistical tests to confirm differences are not due to chance


Examples• Three experiments that ran at Microsoft• All had enough users for statistical validity• OEC: the Overall Evaluation Criterion• See how many you get right

– Three choices are:• A wins (the difference is statistically significant)• A and B are approximately the same (no stat sig diff)• B wins

6


Office OnlineTest new design for Office Online homepage

A

OEC: Clicks on revenue generating links (red below)

Is A better, B better, or are they about the same?

B


Office Online• B was 64% worse• The Office Online team wrote

A/B testing is a fundamental and critical Web services… consistent use of A/B testing could save the company millions of dollars

8


MSN UK Hotmail experimentHotmail module on the MSN UK home page


MSN UK Hotmail experimentA: When user clicks on email

hotmail opens in same windowB: Open hotmail in separate windowTrigger: only users that click in the

module are in experiment (no diff otherwise)

OEC: clicks on home page (after trigger)



UK Hotmail• For those in the experiment, clicks on MSN

Home Page increased +8.9%• <0.001% of users in B wrote negative feedback

about the new window

11


Data Trumps IntuitionWe distribute experiment reports widely at MicrosoftSomeone who saw the report wrote

This report came along at a really good time and was VERY useful.

I argued this point to my team (open Live services in new window from HP) just some days ago. They all turned me down.

Funny, now they have all changed their minds.

12


MSN Home Page Search BoxOEC: Clickthrough rate for Search box and popular searches

A

B

Differences: • A has taller search box (overall size is the same), has magnifying glass

icon, “popular searches” • B has big search button



Search Box• No statistically significant difference• Insight

Stop debating, it’s easier to get the data

14


Hard to Assess the Value of Ideas:Data Trumps Intuition

• At Amazon, half of the experiments failed to show improvement• QualPro tested 150,000 ideas over 22 years

– 75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance…

• Based on experiments with ExP at Microsoft– 1/3 of ideas were positive ideas and statistically significant– 1/3 of ideas were flat: no statistically significant difference– 1/3 of ideas were negative and statistically significant

• Our intuition is poor: 2/3rd of ideas do not improve themetric(s) they were designed to improve. Humbling!

15


The HiPPO

• Our opinions are often wrong – get the data• HiPPO stands for the Highest Paid Person’s Opinion• Hippos kill more humans than any other (non-human)

mammal (really)• Don’t let HiPPOs in your org

kill innovative ideas. ExPeriment!• We give out these toy HiPPOs at Microsoft

16

The less data, the stronger the opinions


Is Software Just Hard? NO! • Doctors have been taking the HiPPocratic Oath and

promising “no harm,” yet many beliefs werewrong for hundreds of years

• For centuries, an illness was thought to be a toxin• Opening a vein and letting the sickness run out

was the best solution – bloodletting• One British medical text recommended bloodletting for

acne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases

• Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes


Bloodletting (2 of 2)• President George Washington had a sore throat

and doctors extracted 82 ounces of blood over 10 hours (35% of his total blood), causing anemia and hypotension. He died that night

• Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or randomized controlled experiment. He treated people with pneumonia either with– early, aggressive bloodletting, or– less aggressive measures

• At the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink

Lancet


Agenda• Intro & controlled experiments in one slide• Examples: you’re the decision maker• Seven pitfalls• Q&A

19


Pitfall 1: Wrong Success MetricRemember this example?

A

OEC: Clicks on revenue generating links (red below)

B


Pitfall 1: Wrong OEC• B had drop in the OEC of 64%• Were sales correspondingly less also?• No. The experiment is valid if the conversion from a

click to purchase is similar• The price was shown only in B, sending more qualified

purchasers to the pipeline• Lesson: measure what you really need to measure,

even if it’s difficult!


Pitfall 2: Incorrect Interval Calculation• Confidence Intervals (CI) are a great way to summarize

results that have variability• Example: 95% CI for conversion rate might be 2.8%-3.2%

(mean of 3.0% +/- 0.2%), which improved from 1.8%-2.2%• Business users prefer percent effect: 2% to 3% is a 50%

improvement in conversion!• How can we provide a confidence interval on the 50%?


Pitfall 2: Incorrect Interval Calculation (cont)

• You can’t just convert the confidence interval to a percent effect because the denominator is a random variable (we have a ratio of means)

• Use Fieller’s formula for an exact percent effect– More complex formula, but that’s why we have computers

(and statisticians who figured this out in 1954)– Note: the confidence interval is not always symmetric

around the mean in this case


Pitfall 3: Using Standard formulas for Standard Deviation

• Many metrics for online experiments cannot use the standard statistical formulas

• Example: Click-through rate = clicks/page-views• The standard statistical approach would assume this

would be approximately Bernoulli• However, the true standard deviation is commonly

larger than Bernoulli because of independence violations• Solution: Bootstrap or the delta method


Best Practice: Ramp-up• Ramp-up

– Start an experiment at 0.1%– Do simple analyses to make sure no egregious problems can be detected– Ramp-up to a larger percentage, and repeat until desired percent (e.g., 50%)

• Big differences are easy to detect because the min sample size is quadratic in the effect we want to detect– Detecting 10% difference requires a small sample and serious problems can

be detected during ramp-up– Detecting 0.1% requires a population 100^2 = 10,000 times bigger

25


Pitfall 4: Combining Data when Percent to Treatment Varies

• Simplified example: 1,000,000 users per day

• For each individual day the Treatment is much better• However, cumulative result for Treatment is worse • This is called Simpson’s Paradox

Conversion Rate for two days

Friday SaturdayTotal

C/T split: 99/1 C/T split: 50/50

Control20,000

= 2.02%5,000

= 1.00%25,000

= 1.68%990,000 500,000 1,490,000

Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22%10,000 500,000 510,000


Pitfall 5: Not Filtering out Robots• Internet sites can get a significant amount of robot

traffic (search engine crawlers, email harvesters, botnets, etc.)

• Robots can cause misleading results– Most concerned about robots with high traffic (e.g. clicks or

PVs) that stay in Treatment or Control– We’ve seen one robot with > 600,000 clicks in a month on

one page (and it was executing JavaScript)


Pitfall 5: Not Filtering out Robots (cont)• Identifying robots can be difficult

– Some robots identify themselves through the UserAgent– Many look like human users and execute Javascript– Use heuristics to ID and remove robots from analysis

(e.g. more than 100 clicks in an hour)– Ongoing research. No silver bullet


Effect of Robots on A/A Experiment• Each hour

representsclicks fromthousandsof users

• The “spikes”can be tracedto single “users”(robots)


Pitfall 6: Invalid or Inadequate Instrumentation• Validating initial instrumentation

– Logging audit – compare experimentation observations with recording system of record

– A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both

• Expect about 5% of metrics to be statistically significant• P-values should be uniformly distributed on the interval (0,1) and no p-

values should be very close to zero (e.g. <0.001)

– Many of our “customers” initially fail one of these tests


Pitfall 7: Insufficient Experimental Control• Must make sure the only difference between Treatment and

Control is the change being tested• Plot shows hourly click-through

rate for Control and Treatmentin the MSN Home Page

• Headlines were supposed to be the same in both

• One headline was different for one 7 hour period, significantly changing the result


Summary1.It is hard to assess the value of ideas

– Get the data by experimenting because data trumps intuition– Examples are humbling– Avinash Kaushik wrote: “…the power of: Controlled Experiments. I am

convinced this is God’s gift to online humanity.”

2.Replace the HiPPO with an OEC– Make sure the org agrees what you are optimizing (long term lifetime value)– Experts are often wrong. Doctors did bloodletting for centuries (and they

swear by the HiPPOcratic oath)

3.Watch out for the pitfalls

32


Resources for Deeper Drive• Controlled Experiments on the Web: Survey and

Practical Guide in Data Mining and Knowledge Discovery journal, 2009 http://exp-platform.com/hippo_long.aspx

• KDD 2009 Tutorialhttp://exp-platform.com/tutorial.aspx

• Contact: ronnyk@ microsoft dot you know what


Live Q&A with Anne, Ronny, Roger

WhichTestWon.com


Thanks, plus 2 free offers: Online Testing Awards

• Free entries• Everyone eligible• Deadline this Friday!

http://whichtestwon.com/awards

Free Landing PageEvaluation Offer

Click to schedule: http://whichtestwon.com/widerfunnel/lp.html

Sponsored By:

Documents

Transcript of Sponsored By: