Sponsored By:
description
Transcript of Sponsored By:
© 2009, Microsoft Corporation
Sponsored By:
Top 7 Testing Pitfalls Presented live November 18, 2009
Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation Platform
Admin Note:Attendees will also get a copy of these slides + an On-demand mp3 of this via email on Thursday afternoon November 19th
© 2009, Microsoft Corporation
First: Why Bother Testing?-> ‘Best Practices’, standard Web design templates, and
marketer’s “gut” often FAIL tests.
-> For previously untested sites, testing gives an average ~ 40% conversion lift.
-> Tests can help you generate better quality leads or sales – not just more conversions.
WhichTestWon.com
© 2009, Microsoft Corporation
Agenda• Intro & controlled experiments in one slide• Examples: you’re the decision maker• Seven pitfalls• Q&A
Pitfalls based on KDD 2009 paper: http://exp-platform.com/ExPpitfalls.aspx by Thomas Crook, Brian Frasca, Ronny Kohavi, and Roger Longbotham
3
© 2009, Microsoft Corporation
Our Experience at Microsoft• The Experimentation Platform started at Microsoft in 2006• Experiments ran on 20 Microsoft properties, including MSN
home pages in several countries, MSN Money, MSN Real estate, www.microsoft.com, store.microsoft.com, support.microsoft.com, Office Online, www.xbox.com, several marketing sites, and Windows Genuine Advantage
• Large experiments run with tens of millions of users• Multiple experiments have projected annual improvements of
over $1M each
© 2009, Microsoft Corporation
Controlled Experiments in One Slide
• Concept is trivial– Randomly split traffic between
two (or more) versions• A (Control)• B (Treatment)
– Collect metrics of interest– Analyze
5
100%Users
50%Users
50%Users
Control:Existing System
Treatment:Existing System with Feature X
Users interactions instrumented, analyzed & compared
Analyze at the end of the experiment
Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)Must run statistical tests to confirm differences are not due to chance
© 2009, Microsoft Corporation
Examples• Three experiments that ran at Microsoft• All had enough users for statistical validity• OEC: the Overall Evaluation Criterion• See how many you get right
– Three choices are:• A wins (the difference is statistically significant)• A and B are approximately the same (no stat sig diff)• B wins
6
© 2009, Microsoft Corporation
Office OnlineTest new design for Office Online homepage
A
OEC: Clicks on revenue generating links (red below)
Is A better, B better, or are they about the same?
B
© 2009, Microsoft Corporation
Office Online• B was 64% worse• The Office Online team wrote
A/B testing is a fundamental and critical Web services… consistent use of A/B testing could save the company millions of dollars
8
© 2009, Microsoft Corporation
MSN UK Hotmail experimentHotmail module on the MSN UK home page
© 2009, Microsoft Corporation
MSN UK Hotmail experimentA: When user clicks on email
hotmail opens in same windowB: Open hotmail in separate windowTrigger: only users that click in the
module are in experiment (no diff otherwise)
OEC: clicks on home page (after trigger)
Is A better, B better, or are they about the same?
© 2009, Microsoft Corporation
UK Hotmail• For those in the experiment, clicks on MSN
Home Page increased +8.9%• <0.001% of users in B wrote negative feedback
about the new window
11
© 2009, Microsoft Corporation
Data Trumps IntuitionWe distribute experiment reports widely at MicrosoftSomeone who saw the report wrote
This report came along at a really good time and was VERY useful.
I argued this point to my team (open Live services in new window from HP) just some days ago. They all turned me down.
Funny, now they have all changed their minds.
12
© 2009, Microsoft Corporation
MSN Home Page Search BoxOEC: Clickthrough rate for Search box and popular searches
A
B
Differences: • A has taller search box (overall size is the same), has magnifying glass
icon, “popular searches” • B has big search button
Is A better, B better, or are they about the same?
© 2009, Microsoft Corporation
Search Box• No statistically significant difference• Insight
Stop debating, it’s easier to get the data
14
© 2009, Microsoft Corporation
Hard to Assess the Value of Ideas:Data Trumps Intuition
• At Amazon, half of the experiments failed to show improvement• QualPro tested 150,000 ideas over 22 years
– 75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance…
• Based on experiments with ExP at Microsoft– 1/3 of ideas were positive ideas and statistically significant– 1/3 of ideas were flat: no statistically significant difference– 1/3 of ideas were negative and statistically significant
• Our intuition is poor: 2/3rd of ideas do not improve themetric(s) they were designed to improve. Humbling!
15
© 2009, Microsoft Corporation
The HiPPO
• Our opinions are often wrong – get the data• HiPPO stands for the Highest Paid Person’s Opinion• Hippos kill more humans than any other (non-human)
mammal (really)• Don’t let HiPPOs in your org
kill innovative ideas. ExPeriment!• We give out these toy HiPPOs at Microsoft
16
The less data, the stronger the opinions
© 2009, Microsoft Corporation
Is Software Just Hard? NO! • Doctors have been taking the HiPPocratic Oath and
promising “no harm,” yet many beliefs werewrong for hundreds of years
• For centuries, an illness was thought to be a toxin• Opening a vein and letting the sickness run out
was the best solution – bloodletting• One British medical text recommended bloodletting for
acne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases
• Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes
© 2009, Microsoft Corporation
Bloodletting (2 of 2)• President George Washington had a sore throat
and doctors extracted 82 ounces of blood over 10 hours (35% of his total blood), causing anemia and hypotension. He died that night
• Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or randomized controlled experiment. He treated people with pneumonia either with– early, aggressive bloodletting, or– less aggressive measures
• At the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink
Lancet
© 2009, Microsoft Corporation
Agenda• Intro & controlled experiments in one slide• Examples: you’re the decision maker• Seven pitfalls• Q&A
19
© 2009, Microsoft Corporation
Pitfall 1: Wrong Success MetricRemember this example?
A
OEC: Clicks on revenue generating links (red below)
B
© 2009, Microsoft Corporation
Pitfall 1: Wrong OEC• B had drop in the OEC of 64%• Were sales correspondingly less also?• No. The experiment is valid if the conversion from a
click to purchase is similar• The price was shown only in B, sending more qualified
purchasers to the pipeline• Lesson: measure what you really need to measure,
even if it’s difficult!
© 2009, Microsoft Corporation
Pitfall 2: Incorrect Interval Calculation• Confidence Intervals (CI) are a great way to summarize
results that have variability• Example: 95% CI for conversion rate might be 2.8%-3.2%
(mean of 3.0% +/- 0.2%), which improved from 1.8%-2.2%• Business users prefer percent effect: 2% to 3% is a 50%
improvement in conversion!• How can we provide a confidence interval on the 50%?
© 2009, Microsoft Corporation
Pitfall 2: Incorrect Interval Calculation (cont)
• You can’t just convert the confidence interval to a percent effect because the denominator is a random variable (we have a ratio of means)
• Use Fieller’s formula for an exact percent effect– More complex formula, but that’s why we have computers
(and statisticians who figured this out in 1954)– Note: the confidence interval is not always symmetric
around the mean in this case
© 2009, Microsoft Corporation
Pitfall 3: Using Standard formulas for Standard Deviation
• Many metrics for online experiments cannot use the standard statistical formulas
• Example: Click-through rate = clicks/page-views• The standard statistical approach would assume this
would be approximately Bernoulli• However, the true standard deviation is commonly
larger than Bernoulli because of independence violations• Solution: Bootstrap or the delta method
© 2009, Microsoft Corporation
Best Practice: Ramp-up• Ramp-up
– Start an experiment at 0.1%– Do simple analyses to make sure no egregious problems can be detected– Ramp-up to a larger percentage, and repeat until desired percent (e.g., 50%)
• Big differences are easy to detect because the min sample size is quadratic in the effect we want to detect– Detecting 10% difference requires a small sample and serious problems can
be detected during ramp-up– Detecting 0.1% requires a population 100^2 = 10,000 times bigger
25
© 2009, Microsoft Corporation
Pitfall 4: Combining Data when Percent to Treatment Varies
• Simplified example: 1,000,000 users per day
• For each individual day the Treatment is much better• However, cumulative result for Treatment is worse • This is called Simpson’s Paradox
Conversion Rate for two days
Friday SaturdayTotal
C/T split: 99/1 C/T split: 50/50
Control20,000
= 2.02%5,000
= 1.00%25,000
= 1.68%990,000 500,000 1,490,000
Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22%10,000 500,000 510,000
© 2009, Microsoft Corporation
Pitfall 5: Not Filtering out Robots• Internet sites can get a significant amount of robot
traffic (search engine crawlers, email harvesters, botnets, etc.)
• Robots can cause misleading results– Most concerned about robots with high traffic (e.g. clicks or
PVs) that stay in Treatment or Control– We’ve seen one robot with > 600,000 clicks in a month on
one page (and it was executing JavaScript)
© 2009, Microsoft Corporation
Pitfall 5: Not Filtering out Robots (cont)• Identifying robots can be difficult
– Some robots identify themselves through the UserAgent– Many look like human users and execute Javascript– Use heuristics to ID and remove robots from analysis
(e.g. more than 100 clicks in an hour)– Ongoing research. No silver bullet
© 2009, Microsoft Corporation
Effect of Robots on A/A Experiment• Each hour
representsclicks fromthousandsof users
• The “spikes”can be tracedto single “users”(robots)
© 2009, Microsoft Corporation
Pitfall 6: Invalid or Inadequate Instrumentation• Validating initial instrumentation
– Logging audit – compare experimentation observations with recording system of record
– A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both
• Expect about 5% of metrics to be statistically significant• P-values should be uniformly distributed on the interval (0,1) and no p-
values should be very close to zero (e.g. <0.001)
– Many of our “customers” initially fail one of these tests
© 2009, Microsoft Corporation
Pitfall 7: Insufficient Experimental Control• Must make sure the only difference between Treatment and
Control is the change being tested• Plot shows hourly click-through
rate for Control and Treatmentin the MSN Home Page
• Headlines were supposed to be the same in both
• One headline was different for one 7 hour period, significantly changing the result
© 2009, Microsoft Corporation
Summary1.It is hard to assess the value of ideas
– Get the data by experimenting because data trumps intuition– Examples are humbling– Avinash Kaushik wrote: “…the power of: Controlled Experiments. I am
convinced this is God’s gift to online humanity.”
2.Replace the HiPPO with an OEC– Make sure the org agrees what you are optimizing (long term lifetime value)– Experts are often wrong. Doctors did bloodletting for centuries (and they
swear by the HiPPOcratic oath)
3.Watch out for the pitfalls
32
© 2009, Microsoft Corporation
Resources for Deeper Drive• Controlled Experiments on the Web: Survey and
Practical Guide in Data Mining and Knowledge Discovery journal, 2009 http://exp-platform.com/hippo_long.aspx
• KDD 2009 Tutorialhttp://exp-platform.com/tutorial.aspx
• Contact: ronnyk@ microsoft dot you know what
© 2009, Microsoft Corporation
Live Q&A with Anne, Ronny, Roger
WhichTestWon.com
© 2009, Microsoft Corporation
Thanks, plus 2 free offers: Online Testing Awards
• Free entries• Everyone eligible• Deadline this Friday!
http://whichtestwon.com/awards
Free Landing PageEvaluation Offer
Click to schedule: http://whichtestwon.com/widerfunnel/lp.html