A/B Testing at SweetIM

A/B Testing at SweetIM – the

Importance of Proper Statistical

Analysis

Slava Borodovsky

SweetIM

[email protected]

Saharon Rosset

Tel Aviv University

[email protected]

About SweetIM

• Provides interactive content and search services for IMs and social

networks.

• More than 1,000,000 monthly new users.

• More than 100,000,000 monthly search queries.

• Every new feature and change in product pass A/B testing.

• Data driven decision making process is based on A/B testing results.

A/B testing – a buzz word?

• Standard A/B Flow

Previous works: • KDD 2009, Microsoft experimentation platform (http://exp-platform.com/hippo.aspx)

• KDD 2010, Google team (“Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”)

LP Visit

Control (A) – 50%

No Change

Feature (B) – 50%

New Product

A/B testing at SweetIM

New Users

Existing Users

LP Visit (Cookie)

Confirmation Page A

(Control)

Cookie A (No Change)

Confirmation Page B (New

Feature)

Cookie B (New Feature)

Activity domain (Unique ID%N)

Unique ID%2 = 0

No Change

Unique ID%2 = 1

New Feature

A/B Test on Search

Group A Group B

Does the change increase the usage of search by users (average searches per day)?

Results

Using the Poisson distribution the difference is significant, but is this assumption appropriate in our case?

Some well known facts of internet traffic:

• Different users can have very different usage patterns

• Existence of non-human users like automatic bots and crawlers.

A B

# Users 17,287 17,295

#Activity 297,485 300,843

Average (μ) 17.21 17.39

StDev 45.83 44.03

% Difference

Poisson p-value

NB p-value

1.08%

0.00003

0.47300

Poisson assumption

• Inhomogeneous behavior of online users is not appropriate for this distribution.

• Variance is much greater than mean.

• Not well suited for the data.

Negative binomial assumption

• Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean).

• Has been used before in other domains to analyze the count data (genetics, traffic modeling).

• Fits well the real distribution.

Poisson

X ~ Pois(), if:

P(X=k) = e- k / k! , k=0,1,2,…

X has mean and variance both equal to the Poisson parameter .

Hypothesis:

H0: A=B

HA: A<B

Distribution of difference between means:

𝑋 − 𝑌 ~𝑁(𝜇𝑎 − 𝜇𝑏 ,𝜎𝑎2

𝑛+

𝜎𝑏2

𝑚)

Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true:

𝑃𝑝𝑜𝑖𝑠 = Φ𝑋 −𝑌

𝜇𝑡 𝑛+𝑚 / 𝑛𝑚

𝑋 , 𝑌 - means of 2 groups, 𝜇 – Poisson parameter, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.

Negative binomial

X ~ NB(𝑎, 𝑝), if:

𝑃(𝑋=𝑘) =𝑎 + 𝑘 − 1

𝑘1 − 𝑝 𝑎𝑝𝑘

In case of over dispersed Poisson, ~ Γ(α,β), X| ~ Pois().

X ~ NB(α, 1/(1+β))

X has mean μ= α/β and variance σ2 = μ(μ+α)/α.

Hypothesis:

H0: A=B

HA: A<B

Distribution of difference between means:

𝑋 − 𝑌 ~𝑁(𝜇𝑎 − 𝜇𝑏 ,𝜇𝑎

𝑛+

𝜇𝑏

𝑚)

Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true:

𝑃𝑁𝐵,𝑎 = Φ𝑋 −𝑌

𝜇𝑡 𝜇𝑡+𝑎

𝑎𝑛+𝑚 / 𝑛𝑚

1/2

𝑋 , 𝑌 - means of 2 groups, 𝜇, 𝑎 – NB parameters, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.

Results (Test 1):

The difference is not significant by NB.

Results of using the proper methodology:

• Saved production from unnecessary change

• Allow to benefit from additional features in A group

A B

# Users 17,287 17,295

#Activity 297,485 300,843

Average (μ) 17.21 17.39

StDev 45.83 44.03

% Difference

Poisson p-value

NB p-value

1.08%

0.00003

0.47300

A/B Tests on Content

Example – two sided content test.

The new feature allows users that don’t have SweetIM installed on their computer to receive funny content from SweetIM users.

Results:

• Implementation of new feature

• Increase in application usage and user experience

A B

# Users 17,568 17,843

#Activity 770,436 930,066

Average (μ) 43.85 52.12

StDev 74.80 96.00

% Difference

Poisson p-value

NB p-value 0.00000

8.68%

0.00000

A/A Tests

Required for checking:

• The randomization algorithm and distribution system.

• The technical aspects of the A/B testing system.

• The properness of methodology, hypothesis and analysis.

• The existence of robots and crawlers.

LP Visit

Control (A) – 50%

No Change

Control (A) – 50%

No Change

A/A Tests

• A/A Test on Search

• A/A Test on Content

A B

# Users 61,719 61,608

#Activity 1,274,279 1,288,333

Average (μ) 20.65 20.91

StDev 52.77 49.69

% Difference

Poisson p-value

NB p-value

1.27%

0.00000

0.11000

A B

# Users 61,719 61,608

#Activity 3,142,766 3,165,208

Average (μ) 50.92 51.38

StDev 97.50 97.00

% Difference

Poisson p-value

NB p-value

0.98%

0.00000

0.18500

“Fraud” Detection

• It is almost impossible to filter all non-human activity on the web.

• Automatic bots and crawlers can bias the results and drive to wrong conclusions.

• Need to be checked in every test.

Conclusions

• Overview of SweetIM A/B Testing system.

• Some insights on statistical aspects of A/B Testing methodology as related to count data analysis.

• Suggestion to use negative binomial approach instead of incorrect Poisson in case of over dispersed count data.

• Real-world examples of A/B and A/A tests from SweetIM.

• A word about fraud

A/B Test on Search Logo

Group A

Group B

Which one was better?

A/B Test on Search Logo

Group A

Group B

A B

# Users 19,249 19,725

#Activity 343,136 355,929

Average (μ) 17.80 18.50

StDev 45.35 47.72

% Difference

Poisson p-value

NB p-value

3.93%

0.00000

0.00780

Thank You!

A/B Testing at SweetIM

Data & Analytics

Transcript of A/B Testing at SweetIM