A/B Testing Pitfalls and Lessons Learned at Spotify

April 9, 2023

Danielle Jabin

A/B Testing: Avoiding Common Pitfalls

Who are you?

Babies named Ava caused the housing bubble

Correlation does not equal causation!

What about in a product context?

Do your improvements actually affect user behavior or are the variations due to chance?

A/B Testing!

But…what’s an A/B test?

Pitfall #1: Not limiting your error rate

What if I flip a coin 100 times, and I get 51 heads?

What if I flip a coin 100 times, and I get 5 heads?

●alpha levels of 5% and 1% are the most common– Alternatively: P(significant) = .05 or .01

Statistical significance: probability that a change is not due to chance alone

●To lock in error rates before you start a test, fix your sample size

Remedies

●To lock in error rates before you start a test, fix your sample size

What should my sample size be?

Sample size in each group (assumes equal sized groups)

Represents the desired power (typically .84 for 80% power).

Represents the desired level of statistical significance (typically 1.96).

Standard deviation of the outcome variable Effect Size (the

difference in means)

difference

Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt

●With σ = 10, difference in means = 1

Let’s see some numbers

One-sided test Two-sided test

alpha = .10, beta = .80 841 1230

alpha = .05, beta = .80 1230 1568

alpha = .01, beta = .80 2010 2339

Pitfall #2: Stopping your test before the fixed sample size stopping point

●1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate

Let’s see some more numbers

Stop at first point of significance

Ended as significant

90% significance reached

654 of 1,000 100 of 1,000

427 of 1,000 49 of 1,000

146 of 1,000 14 of 1,000

●Don’t peek●Okay, maybe you can peek, but don’t stop or make a decision before you

reach the fixed sample size stopping point●Sequential sampling

Remedies

Pitfall #3: Making multiple comparisons in one test

Let’s look at an A/B test from our Discovery feature:

●P(significant) + P(not significant) = 1●Let’s take an alpha of .05

– P(significant) = .05– P(not significant) = 1 – P(significant) = 1 - .05 = .95

A test can be one of two things: significant or not significant

●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95

= .9025●P(at least 1 significant) = 1 - .9025 = .0975

What about for two comparisons?

●That’s almost 2x (1.95x, to be precise) your .05 significance rate!

What about for two comparisons?

And it just gets worse…

P(at least 1 signifcant) An increase of…

5 comparisons 1 – (1-.05)^5 = .23 4.6x

10 comparisons 1 – (1-.05)^10 = .40 8x

20 comparisons 1 – (1-.05)^20 = .64 12.8x

●Bonferroni correction– Divide P(significant), your alpha, by the number of alternate hypotheses you

have, n– alpha/n becomes the new level of statistical significance

How can we remedy this?

●Our new P(significant) = .05/2 = .025●Our new P(not significant) = 1 - .025 = .975●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant)

= .975*.975 = .951●P(at least 1 significant) = 1 - .951 = .0499

So what about two comparisons now?

P(significant) stays under .05

Corrected alpha P(at least 1 signifcant)

5 comparisons .05/5 = .01 1 – (1-.01)^5 = .049

10 comparisons .05/10 = .005 1 – (1-.005)^10 = .049

20 comparisons .05/20 = .0025 1 – (1-.0025)^20 = .049

Questions?

A/B Testing Pitfalls and Lessons Learned at Spotify

Technology

Transcript of A/B Testing Pitfalls and Lessons Learned at Spotify

Avoiding Data Loss Prevention (DLP) Pitfalls A Discussion ...vox.veritas.com/legacyfs/online/veritasdata/IL B12.pdf · 14 DLP Pitfalls — A discussion of lessons learned In our joint

NYC Apps at Spotify HQ: Lessons Learned from Building Shindig 1.0

Music recommendations at Spotify - Meetupfiles.meetup.com/1516886/Recommendations at Spotify v4.pdf · Music recommendations at Spotify Erik Bernhardsson erikbern@spotify.com. Spotify

Sony Spotify contract

Pitfalls and lessons learned with Node...2011/09/02 · Pitfalls and lessons learned with Node.JS Juho Mäkinen Software Architect at Appliﬁer @juhomakinen Reaktor Dev Day 2011-09-02

Agile spotify

Spotify Slides Final

Spotify Apps

Spotify Places

Spotify personality (final)

Spotify UNAM Mobile

Zoe Keating - 9 Months of Spotify (Scroll Down for Totals) - Spotify

The New Audio. Reaching the Spotify Listener in Turkey Kerli Nurmoja Spotify

Spotify Chords mobile

A/B Testing Pitfalls and Lessons Learned at Spotify

Spotify Guest

Spotify - youthteachingadults.ca · Spotify is an app that lets you listen to music for free. Spotify. Spotify Find the Spotify app Ask your adult learner to find the Spotify app.

Spotify presentation

UNDERSTANDING SPOTIFY

Supplement for Spotify - fr.yamaha.com · Dispositivo mobile o tablet Modem Spotify e il logo Spotify sono marchi registrati di Spotify Group. ... 2 Abra o aplicativo do Spotify no