Talks@Coursera - A/B Testing @ Internet Scale
-
Upload
courseratalks -
Category
Engineering
-
view
6.839 -
download
2
description
Transcript of Talks@Coursera - A/B Testing @ Internet Scale
A/B Testing @ Internet Scale
Ya Xu 8/12/2014 @ Coursera
A/B Testing in One Slide
20% 80%
Collect results to determine which one is better
Join now
Control Treatment
Outline
§ Culture Challenge – Why A/B testing – What to A/B test
§ Building a scalable experimentation system § Best practices
3
Why A/B Testing
Amazon Shopping Cart Recommendation
5
• At Amazon, Greg Linden had this idea of showing recommendations based on cart items
• Trade-offs • Pro: cross-sell more items (increase average basket size) • Con: distract people from checking out (reduce conversion)
• HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
MSN Real Estate
§ “Find a house” widget variations § Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
Take-away
Experiments are the only way to prove causality.
7
Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback
What to A/B Test
8
Ads CTR Drop
9
Sudden drop on 11/11/2013
Profile top ads
Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads
What to A/B Test
§ Evaluating new ideas: – Visual changes – Complete redesign of web page – Relevance algorithms – …
§ Platform changes § Code refactoring § Bug fixes
11
Test Everything!
Startups vs. Big Websites
§ Do startups have enough users to A/B test? – Startups typically look for larger effects – 5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture early
12
A Scalable Experimentation System
13
A/B Testing 3 Steps
14
Design • What/Whom to experiment on
Deploy • Code deployment
Analyze • Impact on metrics
A/B Testing Platform Architecture
1. Experiment Management 2. Online Infrastructure 3. Offline Analysis
15
Example: Bing A/B
1. Experiment Management
§ Define experiments – Whom to target? – How to split traffic?
§ Start/stop an experiment § Important addition:
– Define success criteria – Power analysis
16
2. Online Infrastructure
1) Hash & partition: random & consistent
2) Deploy: server-side, as a change to – The default configuration (Bing) – The default code path (LinkedIn)
3) Data logging
17
0% 100%
Treatment1
D 20% D 20%
Hash (ID)
Treatment2 Control
Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D 20% D 20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15% 30%
• Does not scale • Traffic management
Hash & Partition @ Scale (II)
§ Fully overlapping system 0% 100%
D Exp. 2
A2 B2 control
Exp.1
control A1
D
B1
D
• Each experiment gets 100% traffic • A user is in “all” experiments simultaneously • Randomization btw experiments are independent
(unique hashID) • Cannot avoid interaction
Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
• Centralized management (Bing) • Central exp. team creates/manages layers/domains
• De-centralized management (LinkedIn) • Each experiment is one “layer” by default • Experimenter controls hashID to create a “domain”
Data Logging
§ Trigger
§ Trigger-based logging – Log whether a request is actually affected by the
experiment – Log for both factual & counter-factual
21
All LinkedIn members 300MM +
Triggered: Members visiting contacts page
3. Automated Offline Analysis
§ Large-scale data processing, e.g. daily @LinkedIn – 200+ experiments – 700+ metrics – Billions of experiment trigger events
§ Statistical analysis – Metrics design – Statistical significance test (p-value, confidence interval) – Deep-dive: slicing & dicing capability
§ Monitoring & alerting – Data quality – Early termination
22
Best Practices
23
Example: Unified Search
What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N 50% En-US
Pre-unified search 50%
En-US
What to Measure?
§ Success metrics: summarize whether treatment is better
§ Puzzling example: – Key metrics for Bing: number of searches &
revenue – Ranking bug in experiment resulted in poor search
results – Number of searches up +10% and revenue up
+30%
Success metrics should reflect long term impact
Scientific Experiment Design
§ How long to run the experiment? § How much traffic to allocate to treatment? Story: § Site speed matters
– Bing: +100msec = -0.6% revenue – Amazon: +100msec = -1.0% revenue – Google: +100msec = -0.2% queries
§ But not for Etsy.com? “Faster results better? … meh”
27
Power
§ Power: the chance of detecting a difference when there really is one.
§ Two reasons your feature doesn’t move metrics
1. No “real” impact 2. Not enough power
28
Properly power up your experiment!
Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% 2.4%
Statistical Significance
§ Which experiment has a bigger impact?
30
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
Statistical Significance
31
§ Must consider statistical significance – A 12.9% delta can still be noise! – Identify signal from noise; focus on the “real” movers – Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
Multiple Testing
§ Famous xkcd comic on Jelly Beans
32
Multiple Testing Concerns
§ Multiple ramps – Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks” – Rely on “full”-week results
§ Multiple variants – Choose the best, then rerun to see if replicate
§ Multiple metrics
An irrelevant metric is statistically significant. What to do? § Which metric? § How “significant”? (p-value)
34
34
All metrics
2nd order metrics
1st order metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5
References
§ Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010.
§ Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.
§ LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35