A/B Testing at SweetIM

19
A/B Testing at SweetIM – the Importance of Proper Statistical Analysis Slava Borodovsky SweetIM [email protected] Saharon Rosset Tel Aviv University [email protected]

Transcript of A/B Testing at SweetIM

Page 1: A/B Testing at SweetIM

A/B Testing at SweetIM – the

Importance of Proper Statistical

Analysis

Slava Borodovsky

SweetIM

[email protected]

Saharon Rosset

Tel Aviv University

[email protected]

Page 2: A/B Testing at SweetIM

About SweetIM

• Provides interactive content and search services for IMs and social

networks.

• More than 1,000,000 monthly new users.

• More than 100,000,000 monthly search queries.

• Every new feature and change in product pass A/B testing.

• Data driven decision making process is based on A/B testing results.

Page 3: A/B Testing at SweetIM

A/B testing – a buzz word?

• Standard A/B Flow

Previous works: • KDD 2009, Microsoft experimentation platform (http://exp-platform.com/hippo.aspx)

• KDD 2010, Google team (“Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”)

LP Visit

Control (A) – 50%

No Change

Feature (B) – 50%

New Product

Page 4: A/B Testing at SweetIM

A/B testing at SweetIM

New Users

Existing Users

LP Visit (Cookie)

Confirmation Page A

(Control)

Cookie A (No Change)

Confirmation Page B (New

Feature)

Cookie B (New Feature)

Activity domain (Unique ID%N)

Unique ID%2 = 0

No Change

Unique ID%2 = 1

New Feature

Page 5: A/B Testing at SweetIM

A/B Test on Search

Group A Group B

Does the change increase the usage of search by users (average searches per day)?

Page 6: A/B Testing at SweetIM

Results

Using the Poisson distribution the difference is significant, but is this assumption appropriate in our case?

Some well known facts of internet traffic:

• Different users can have very different usage patterns

• Existence of non-human users like automatic bots and crawlers.

A B

# Users 17,287 17,295

#Activity 297,485 300,843

Average (μ) 17.21 17.39

StDev 45.83 44.03

% Difference

Poisson p-value

NB p-value

1.08%

0.00003

0.47300

Page 7: A/B Testing at SweetIM

Poisson assumption

• Inhomogeneous behavior of online users is not appropriate for this distribution.

• Variance is much greater than mean.

• Not well suited for the data.

Page 8: A/B Testing at SweetIM

Negative binomial assumption

• Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean).

• Has been used before in other domains to analyze the count data (genetics, traffic modeling).

• Fits well the real distribution.

Page 9: A/B Testing at SweetIM

Poisson

X ~ Pois(), if:

P(X=k) = e- k / k! , k=0,1,2,…

X has mean and variance both equal to the Poisson parameter .

Hypothesis:

H0: A=B

HA: A<B

Distribution of difference between means:

𝑋 − 𝑌 ~𝑁(𝜇𝑎 − 𝜇𝑏 ,𝜎𝑎2

𝑛+

𝜎𝑏2

𝑚)

Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true:

𝑃𝑝𝑜𝑖𝑠 = Φ𝑋 −𝑌

𝜇𝑡 𝑛+𝑚 / 𝑛𝑚

𝑋 , 𝑌 - means of 2 groups, 𝜇 – Poisson parameter, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.

Page 10: A/B Testing at SweetIM

Negative binomial

X ~ NB(𝑎, 𝑝), if:

𝑃(𝑋=𝑘) =𝑎 + 𝑘 − 1

𝑘1 − 𝑝 𝑎𝑝𝑘

In case of over dispersed Poisson, ~ Γ(α,β), X| ~ Pois().

X ~ NB(α, 1/(1+β))

X has mean μ= α/β and variance σ2 = μ(μ+α)/α.

Hypothesis:

H0: A=B

HA: A<B

Distribution of difference between means:

𝑋 − 𝑌 ~𝑁(𝜇𝑎 − 𝜇𝑏 ,𝜇𝑎

𝑛+

𝜇𝑏

𝑚)

Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true:

𝑃𝑁𝐵,𝑎 = Φ𝑋 −𝑌

𝜇𝑡 𝜇𝑡+𝑎

𝑎𝑛+𝑚 / 𝑛𝑚

1/2

𝑋 , 𝑌 - means of 2 groups, 𝜇, 𝑎 – NB parameters, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.

Page 11: A/B Testing at SweetIM

Results (Test 1):

The difference is not significant by NB.

Results of using the proper methodology:

• Saved production from unnecessary change

• Allow to benefit from additional features in A group

A B

# Users 17,287 17,295

#Activity 297,485 300,843

Average (μ) 17.21 17.39

StDev 45.83 44.03

% Difference

Poisson p-value

NB p-value

1.08%

0.00003

0.47300

Page 12: A/B Testing at SweetIM

A/B Tests on Content

Example – two sided content test.

The new feature allows users that don’t have SweetIM installed on their computer to receive funny content from SweetIM users.

Results:

• Implementation of new feature

• Increase in application usage and user experience

A B

# Users 17,568 17,843

#Activity 770,436 930,066

Average (μ) 43.85 52.12

StDev 74.80 96.00

% Difference

Poisson p-value

NB p-value 0.00000

8.68%

0.00000

Page 13: A/B Testing at SweetIM

A/A Tests

Required for checking:

• The randomization algorithm and distribution system.

• The technical aspects of the A/B testing system.

• The properness of methodology, hypothesis and analysis.

• The existence of robots and crawlers.

LP Visit

Control (A) – 50%

No Change

Control (A) – 50%

No Change

Page 14: A/B Testing at SweetIM

A/A Tests

• A/A Test on Search

• A/A Test on Content

A B

# Users 61,719 61,608

#Activity 1,274,279 1,288,333

Average (μ) 20.65 20.91

StDev 52.77 49.69

% Difference

Poisson p-value

NB p-value

1.27%

0.00000

0.11000

A B

# Users 61,719 61,608

#Activity 3,142,766 3,165,208

Average (μ) 50.92 51.38

StDev 97.50 97.00

% Difference

Poisson p-value

NB p-value

0.98%

0.00000

0.18500

Page 15: A/B Testing at SweetIM

“Fraud” Detection

• It is almost impossible to filter all non-human activity on the web.

• Automatic bots and crawlers can bias the results and drive to wrong conclusions.

• Need to be checked in every test.

Page 16: A/B Testing at SweetIM

Conclusions

• Overview of SweetIM A/B Testing system.

• Some insights on statistical aspects of A/B Testing methodology as related to count data analysis.

• Suggestion to use negative binomial approach instead of incorrect Poisson in case of over dispersed count data.

• Real-world examples of A/B and A/A tests from SweetIM.

• A word about fraud

Page 17: A/B Testing at SweetIM

A/B Test on Search Logo

Group A

Group B

Which one was better?

Page 18: A/B Testing at SweetIM

A/B Test on Search Logo

Group A

Group B

A B

# Users 19,249 19,725

#Activity 343,136 355,929

Average (μ) 17.80 18.50

StDev 45.35 47.72

% Difference

Poisson p-value

NB p-value

3.93%

0.00000

0.00780

Page 19: A/B Testing at SweetIM

Thank You!