Statistics in 40 Minutes: A/B Testing Fundamentals
-
Upload
leonid-pekelis -
Category
Data & Analytics
-
view
90 -
download
8
Transcript of Statistics in 40 Minutes: A/B Testing Fundamentals
![Page 1: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/1.jpg)
Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely
#opticon2015
![Page 2: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/2.jpg)
You have your own unique approach to A/B Testing
![Page 3: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/3.jpg)
The goal of this talk is to break down A/B Testing to its
fundamentals.
![Page 4: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/4.jpg)
![Page 5: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/5.jpg)
A/B Testing Platform1) Create an experiment
2) Read the results page
![Page 6: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/6.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 7: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/7.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 8: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/8.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 9: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/9.jpg)
1. A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.
2. False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many low signal goals.
3. All three levers are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.
The answers
![Page 10: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/10.jpg)
First, some vocabulary (yay!)
![Page 11: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/11.jpg)
• Control and Variation A control is the original, or baseline version of content that you are testing through a variation.
• Goal Metric used to measure impact of control and variation
• Baseline conversion rate The control group’s expected conversion rate.
• Effect size The improvement (positive or negative) of your variation over baseline.
• Sample size The number of visitors in your test.
![Page 12: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/12.jpg)
• A hypothesis test is a control, and variation that you
want to show improves a goal
• An experiment is a collection of hypotheses (goals &
variation pairs) that all have the same control
![Page 13: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/13.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 15: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/15.jpg)
What is a good hypothesis (test)?
![Page 16: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/16.jpg)
Why is this not actionable?
“I think changing the header image will make
my site better.”
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
Test creep!Bad hypothesis
![Page 17: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/17.jpg)
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
• Removing the header will increase engagement?
• Removing the header will increase total revenue?
• Removing the header will increase “the finals” clicks?
• Growing the header will increase engagement?
• Growing the header with increase “the finals” clicks.
• …
“I think changing the header image will make
my site better.”
Bad hypothesis Good hypotheses
Organized and clear
![Page 18: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/18.jpg)
The more relationships (hypotheses) you test,
the longer (visitors) it will take
to achieve the same outcome (error rate).
Hypotheses also give the cost of your experiment
![Page 19: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/19.jpg)
Questions to check for a good hypothesis
What are you trying to show with your idea?
What key metrics should it drive?
Are all my goals and variations necessary given my testing limits?
![Page 20: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/20.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 21: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/21.jpg)
1. What makes a good hypothesis?
Answer: A good hypothesis has a variation and a clearly defined goal which you crafted ahead of time.
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 22: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/22.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 24: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/24.jpg)
What are the possible outcomes?
![Page 25: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/25.jpg)
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
![Page 27: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/27.jpg)
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
![Page 29: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/29.jpg)
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
“True” value of hypothesis
Improvement No effect
Result of test
Winner / Loser True positive False
positive
Inconclusive False negative
True negative
(no effect, winner / loser)
(+/- improvement, inconclusive)
(+/- improvement, winner / loser)
(no effect, inconclusive)
![Page 30: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/30.jpg)
The 2x2 table will help us to
1. Keep track of different error rates we care about
2. Explore the consequences of controlling false positives vs false
discoveries
![Page 31: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/31.jpg)
Error rate 1: False positive rate
![Page 32: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/32.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
![Page 33: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/33.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
![Page 34: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/34.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
![Page 35: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/35.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
• False positive rate (Type I error)
= “Chance of a false positive from a variation with no effect on a goal”
• Thresholding the FPR
“When I have a variation with no effect on a goal, I’ll find an effect less than 10% of the time.”
![Page 36: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/36.jpg)
How can we ever compute a False Positive Rate if we don’t know whether a hypothesis is true or not?
Statistical tests (fixed horizon t-test, Stats Engine) are designed to
threshold an error rate.
Example:
“Calling winners & losers when a p-value is below .05 will guarantee a
False Positive Rate below 5%”
![Page 37: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/37.jpg)
False Positive Rates with multiple tests
![Page 38: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/38.jpg)
https://xkcd.com/882/
![Page 39: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/39.jpg)
https://xkcd.com/882/
![Page 40: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/40.jpg)
https://xkcd.com/882/
![Page 41: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/41.jpg)
What happened?
21 tests X 5% FPR = 1 False Positive on average
![Page 42: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/42.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner / Loser
True positive
False positive
Inconc-lusive
False negative
True negative
False positive rates are only useful in the context of all hypotheses
![Page 43: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/43.jpg)
Error rate 2: False discovery rate
![Page 44: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/44.jpg)
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 45: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/45.jpg)
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 46: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/46.jpg)
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 47: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/47.jpg)
• False discovery rate (FDR)
= “Chance of a false positive from a conclusive result”
• Thresholding the FDR
= “When you see a winning or losing goal on a variation, it’s wrong less than 10% of the time.”
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 48: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/48.jpg)
or X 5% FDR = 0.05 False Positives on average
![Page 49: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/49.jpg)
“True” value of hypothesis
Improve-ment No effect
Result of test
Winner / loser
True positive
False positive
Inconc-lusive
False negative
True negative
False discovery rates are useful despite the number of hypotheses
![Page 50: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/50.jpg)
What’s the catch?
The more hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
![Page 51: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/51.jpg)
Not quite …
![Page 52: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/52.jpg)
low signal high signal
![Page 53: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/53.jpg)
What’s the catch?
The more low signal hypotheses (goals & variations) in your experiment, the longer it takes to find conclusives.
![Page 54: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/54.jpg)
Recap
• False Positive Rate thresholding
-controls the chance of a false positive when you have a hypothesis with no effect
-misrepresents your error rate with multiple goals and variations
• False Discovery Rate thresholding
-controls the chance of a false positive when you have a winning or losing hypothesis
-is accurate regardless of how many hypotheses you run
-can take longer to reach significance with more low signal variations on goals
![Page 55: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/55.jpg)
Tips & Tricks for running experiments with False Discovery Rates
• Ask: Which goal is most important to me?
-This should be my primary goal (not impacted by all other goals)
• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration
• A little human intuition and prior knowledge can go a long way towards reducing the runtime of your experiments
![Page 56: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/56.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 57: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/57.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
False Positive Rates have hidden sources of error when testing many goals and variations. False Discovery Rates correct this by increasing runtime when you have many noisy goals.
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 58: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/58.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 59: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/59.jpg)
“3 Levers” of A/B Testing
1.Threshold an error rate
• “I want no more than 10% false discovery rate”
2.Detecting effect sizes (setting an MDE)
• “I’m OK with only detecting greater than 5% improvement”
3.Running tests longer
• “I can afford to run this test for 3 weeks, or 50,000
visitors”
![Page 60: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/60.jpg)
Fundamental Tradeoff of A/B Testing
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
![Page 61: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/61.jpg)
Error rates Runtime
Effect size / Baseline CR
At any number of visitors,
the less you threshold your error rate,
the smaller effect sizes you can detect.
All Inversely Related
![Page 62: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/62.jpg)
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger effect sizes.
![Page 63: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/63.jpg)
Error rates Runtime
Effect size / Baseline CR
All Inversely Related
For any effect size,
the lower error rate you want,
the longer you need to run your test.
![Page 64: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/64.jpg)
What does this look like in practice?
Average Visitors needed to reach significance with Stats Engine
Improvement (relative)
5% 10% 25%
Significance Threshold
95% 62,400 13,500 1,800
90% 59,100 12,800 1,700
80% 52,600 11,400 1,500
Baseline conversion rate = 10%
![Page 65: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/65.jpg)
All A/B Testing platforms address the fundamental tradeoff …
1. Choose a minimum detectable effect (MDE) and false positive rate threshold
2. Find a required sampled minimum sample size with a sample size calculator
3. Wait until the minimum sample size is reached
4. Look at your results once and only once
![Page 66: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/66.jpg)
Optimizely is the only platform that lets you pull the levers in
real time
![Page 68: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/68.jpg)
5%
Error rates Runtime
Effect size / Baseline CR
-
+5%, 10%
52,600
?In the beginning, we make an educated guess …
![Page 69: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/69.jpg)
5%
Error rates Runtime (remaining)
Effect size / Baseline CR
-
+13%, 16%
1,600
Instead of: 52,600 - 7,200 = 45,400
… but then the improvement turns out to be better …
![Page 70: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/70.jpg)
5%
Error rates Runtime (remaining)
Effect size / Baseline CR
-
+2%, 8%
> 100,000
… or a lot worse.
![Page 71: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/71.jpg)
Recap
• The Fundamental Tradeoff of A/B Testing affects you no matter what testing platform you use.
-If you want to detect a 5% Improvement on a 10% baseline conversion rate, you should be prepared to wait for at least 50,000 visitors
• Optimizely’s Stats Engine is the only platform that allows you to adjust the trade-off in real time while still reporting valid error rates
![Page 72: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/72.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
At the end of this talk, you should be able to answer
![Page 73: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/73.jpg)
1. What makes a good hypothesis?
2. What are the differences between False Positive Rate and False Discovery Rate?
3. How can I trade off pulling the 3 levers I have: thresholding error rates, detecting smaller improvements, and running my tests longer?
All three are inversely related. For example, running my tests longer can get me lower error rates, or detect smaller effects.
At the end of this talk, you should be able to answer
![Page 74: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/74.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 75: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/75.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 76: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/76.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 77: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/77.jpg)
![Page 78: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/78.jpg)
Statistics in 40 Minutes: A/B Testing FundamentalsLeo PekelisStatistician, Optimizely
#opticon2015
![Page 79: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/79.jpg)
Outcomes & Error Rates
Fundamental Tradeoff
Confidence Intervals
Hypotheses
XX X
“A/B Testing Playbook”Opening
Mid-game
Mid-game
Closing
![Page 80: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/80.jpg)
![Page 81: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/81.jpg)
Definition:
A confidence interval is a range of values for your metric (revenue, conversion rate, etc.) that is 90%* likely to contain the true difference between your variation and baseline.
![Page 82: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/82.jpg)
15.41
11.4Middle Ground
Best Case
Worst case
7.29
![Page 83: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/83.jpg)
This is true regardless of your significance.
![Page 85: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/85.jpg)
We can’t wait for significance
![Page 86: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/86.jpg)
The confidence interval tells us what we need to know
![Page 87: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/87.jpg)
A confidence interval is the mirror image of statistical significance
Mathematical Definition:
The set of parameter values X so that a hypothesis test with null hypothesis
H0: Removing a distracting header will result in X more revenue per visitor.
is not yet rejected.
![Page 88: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/88.jpg)
Error rate 3: False negative rate
![Page 89: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/89.jpg)
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 90: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/90.jpg)
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 91: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/91.jpg)
• False negative rate (Type II error)
= “Rate of false negatives from all hypotheses that could have been false negatives.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative
![Page 92: Statistics in 40 Minutes: A/B Testing Fundamentals](https://reader030.fdocuments.in/reader030/viewer/2022032620/55cacca7bb61ebe80d8b4720/html5/thumbnails/92.jpg)
• False negative rate (Type II error)
= “Rate of false negatives from all variations with an improvement on a goal.”
= #(False negative) / #(Improve)
• Thresholding Type II error
= “When have a goal on a variation with an effect, you miss it less than 10% of the time.”
“True” value of hypothesis
Improve-ment No effect
Outcome of test
Winner /Loser
True positive
False positive
Inconc-lusive
False negative
True negative