An Evaluation Framework for Personalization Strategy ...An Evaluation Framework for Personalization...
Transcript of An Evaluation Framework for Personalization Strategy ...An Evaluation Framework for Personalization...
An Evaluation Framework for Personalization StrategyExperiment Designs
C. H. Bryan [email protected]
Imperial College London & ASOS.com, UK
Emma J. McCoy
Imperial College London, UK
ABSTRACTOnline Controlled Experiments (OCEs) are the gold standard inevaluating the effectiveness of changes to websites. An importanttype of OCE evaluates different personalization strategies, whichpresent challenges in low test power and lack of full control ingroup assignment. We argue that getting the right experimentsetupβthe allocation of users to treatment/analysis groupsβshouldtake precedence of post-hoc variance reduction techniques in orderto enable the scaling of the number of experiments. We present anevaluation framework that, along with a few simple rule of thumbs,allow experimenters to quickly compare which experiment setupwill lead to the highest probability of detecting a treatment effectunder their particular circumstance.
KEYWORDSExperimentation, Experiment design, Personalization strategies,A/B testing.
1 INTRODUCTIONThe use of Online Controlled Experiments (OCEs, e.g. A/B tests)has become popular in measuring the impact of digital productsand services, and guiding business decisions on theWeb [10]. Majorcompanies report running thousands of OCEs on any given day [6,8, 14] and many startups exist purely to manage OCEs [2, 7].
A large number of OCEs address simple variations on elements ofthe user experience based on random splits, e.g. showing a differentcolored button to users based on a user ID hash bucket. Here, we areinterested in experiments that compare personalization strategies,complex sets of targeted user interactions that are common ine-commerce and digital marketing, and measure the change toa performance metric of interest (metric hereafter). Examples ofpersonalization strategies include the scheduling, budgeting andordering of marketing activities directed at a user based on theirpurchase history.
Experiments for personalization strategies face two unique chal-lenges. Firstly, strategies are often only applicable to a small fractionof the user base, and thus many simple experiment designs suf-fer from either a lack of sample size / statistical power, or dilutedmetric movement by including irrelevant samples [3]. Secondly,as users are not randomly assigned a priori, but must qualify tobe treated with a strategy via their actions or attributes, groupsof users subjected to different strategies cannot be assumed to bestatistically equivalent and hence are not directly comparable.
While there are a number of variance reduction techniques (in-cluding stratification and control variates [4, 12]) that partiallyaddress the challenges, the strata and control variates involved canvary dramatically from one personalization strategy experimentto another, requiring many ad hoc adjustments. As a result, such
techniques may not scale well when organizations design and runhundreds or thousands of experiments at any given time.
We argue that personalization strategy experiments should focuson the assignment of users from the strategies they qualified forto the treatment/analysis groups. We call this mapping process anexperiment setup. Identifying the best experiment setup increasesthe chance to detect any treatment effect. An experimentationframework can also reuse and switch between different setupsquickly with little custom input, ensuring the operation can scale.More importantly, the process does not hinder the subsequentapplication of variance reduction techniques, meaning that we canstill apply the techniques post hoc if required.
To date, many experiment setups exist to compare personaliza-tion strategies. An increasingly popular approach is to compare thestrategies using multiple control groups β Quantcast calls it a dualcontrol [1], and Facebook calls it a multi-cell lift study [9]. In thetwo-strategy case, this involves running two experiments on tworandom partitions of the user base in parallel, with each experimentfurther splitting the respective partition into treatment/control andmeasuring the incrementality (the change in a metric as comparedto the case where nothing is done) of each strategy. The incremen-tality of the strategies are then compared against each other.
Despite the setup above gaining traction in display advertising,there is a lack of literature on whether it is a better setupβonethat has a higher sensitivity and/or apparent effect size than othersetups. While [9] noted that multi-cell lift studies require a largenumber of users, they did not discuss how the number compares toother setups.1 The ability to identify and adopt a better experimentsetup can reduce the required sample size, and hence enable morecost-effective experimentation.
We address the gap in the literature by introducing an evaluationframework that compares experiment setups given two personal-ization strategies. The framework is designed to be flexible so thatit is able to deal with a wide range of baselines and changes in userresponses presented by any pairs of strategies (situations hereafter).We also recognize the need to quickly compare common setups, andprovide some simple rule of thumbs on situations where a setupwill be better than another. In particular, we outline the conditionswhere employing a multi-cell setup, as well as metric dilution, isdesirable.
To summarize, our contributions are: (i) We develop a flexibleevaluation framework for personalization strategy experiments,where one can compare two experiment setups given the situationpresented by two competing strategies (Section 2); (ii) We providesimple rule of thumbs to enable experimenters who do not requirethe full flexibility of the framework to quickly compare common
1A single-cell lift study is often used to measure the incrementality of a single person-alization strategy, and hence is not a representative comparison.
arX
iv:2
007.
1163
8v2
[st
at.M
E]
17
Dec
202
0
AdKDD β20, Aug 2020, Online Liu and McCoy
Qualifyforstrategy1
Qualifyforstrategy2
Group 0:Qualifyforneitherstrategy
Group1 Group2Group3
Figure 1: Venn diagram of the user groups in our evaluationframework. The outer, left inner (red), and right inner (blue)boxes represent the entire user base, those who qualify forstrategy 1, and those who qualify for strategy 2 respectively.
setups (Section 3); and (iii) We make our results useful to practi-tioners by making the code used in the paper (Section 4) publiclyavailable.2
2 EVALUATION FRAMEWORKWe first present our evaluation framework for personalization strat-egy experiments. The experiments compare two personalizationstrategies, which we refer to as strategy 1 and strategy 2. Often oneof them is the existing strategy, and the other is a new strategy weintend to test and learn from. In this section we introduce (i) howusers qualifying themselves into strategies creates non-statisticallyequivalent groups, (ii) how experimenters usually assign the users,and (iii) when we would consider an assignment to be better.
2.1 User groupingAs users qualify themselves into the two strategies, four disjointgroups emerge: those who qualify for neither strategy, those whoqualify only for strategy 1, those who qualify only for strategy 2,and those who qualify for both strategies. We denote these groups(user) groups 0, 1, 2, and 3 respectively (see Figure 1). It is perhapsobvious that we cannot assume those in different user groups arestatistically equivalent and compare them directly.
We assume groups 0, 1, 2, 3 have π0, π1, π2, and π3 users respec-tively. We also assume responses from users (which we aggregate,often by taking the mean, to obtain our metric) are distributed dif-ferently between groups, and within the same group, between thescenario where the group is subjected to the treatment associatedto the corresponding strategy and where nothing is done (baseline).We list all group-scenario combinations in Table 1, and denote themean and variance of the responses (`πΊ , π2
πΊ) for a combinationπΊ .3
2.2 Experiment setupsMany experiment setups exist and are in use in different organiza-tions. Here we introduce four common setups of various sophisti-cation, which we also illustrate in Figure 2.
2The code and supplementary document are available on: https://github.com/liuchbryan/experiment_design_evaluation.3For example, the responses for group 1 without any interventions has mean andvariance (`πΆ1 ,π2
πΆ1), and that for group 2with the treatment prescribed under strategy 2has mean and variance (`πΌ2 , π2
πΌ2).
β’ β’ β’ β’ β’ β’ β’ β’ β’β’ β’ β’ β’ β’ β’ β’ β’ β’β’ β’ β’ β’ β’ β’ β’ β’ β’
β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’
A
B
A
B
Setup1 Setup2
A
B
Setup3
β’ β’ β’β’ β’ β’β’ β’ β’
βΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎβΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎβΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎ
βΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎβΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎβΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎ
βΎ βΎβΎ βΎβΎ βΎ
β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’
A=A2-A1
B=B2-B1
Setup4
βΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎβΎ βΎ βΎ βΎ βΎ βΎ βΎ βΎ
A1A2
B1B2
Figure 2: Experiment setups overlaid on the user groupingVenn diagram in Figure 1. The hatched boxes indicate whoare included in the analysis, and the downward triangles anddots indicate who are subjected to treatment prescribed un-der strategies 1 and 2 respectively. See Section 2.2 for a de-tailed description.
Setup 1 (Users in the intersection only). The setup considers userswho qualify for both strategies only. The said users are randomlysplit (usually 50/50) into two (analysis) groups π΄ and π΅, and areprescribed the treatment specified by strategies 1 and 2 respectively.The setup is easy to implement, though it is difficult to translateany learnings obtained from the setup to other user groups (e.g.those who qualify for one strategy only) [5].
Setup 2 (All samples). The setup is a simple A/B test where itconsiders all users, regardless on whether they qualify for anystrategy or not. The users are randomly split into two analysisgroups π΄ and π΅, and are prescribed the treatment specified bystrategy 1(2) if (i) they qualify under the strategy and (ii) they are inanalysis group π΄(π΅). This setup is easiest to implement but usuallysuffers severely from a dilution in metric [3].
Setup 3 (Qualified users only). The setup is similar to Setup 2except only those who qualified for at least one strategy (βtriggeredβusers in some literature [3]) are included in the analysis groups. Thesetup sits between Setup 1 and Setup 2 in terms of user coverage,and has the advantage of capturing the most number of usefulsamples yet having the least metric dilution. However, the setupalso prevents one from telling the incrementality of a strategy itself,but only the difference in incrementalities between two strategies.
Setup 4 (Dual control / multi-cell lift test). As described in Sec-tion 1, the setup first split the users randomly into two randomiza-tion groups. For the first randomization group, we consider thosewho qualify for strategy 1, and split them into analysis groups π΄1and π΄2. Group π΄2 receives the treatment prescribed under strat-egy 1, and group π΄1 acts as control. The incrementality for strat-egy 1 is then the difference in metric between groupsπ΄2 andπ΄1. We
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
Group 0 Group 1 Group 2 Group 3Baseline (Control) πΆ0 πΆ1 πΆ2 πΆ3
Under treatment (Intervention) / πΌ1 πΌ2 Under strategy 1: πΌπ
Under strategy 2: πΌπ
Table 1: All group-scenario combinations in our evaluation framework for personalization strategy experiments. The columnsrepresent the groups described in Figure 1. The baseline represents the scenario where nothing is done. We assume those whoqualify for both strategies (Group 3) can only receive treatment(s) associated to either strategy.
apply the same process to the second randomization group, withstrategy 2 and analysis groups π΅1 and π΅2 in place, and comparethe incrementality for strategies 1 and 2. The setup allows one toobtain the incrementality of each individual strategy and minimizesmetric dilution. Though it also leaves a number of samples unusedand creates extra analysis groups, and hence generally suffers froma low test power [9].
2.3 Evaluation criteriaThere are a number of considerations when one evaluates compet-ing experiment setups. These include technical considerations suchas the complexity of setting up the setups on an OCE framework,and business considerations such as whether the incrementality ofindividual strategies is required.
Here we focus on the statistical aspect and propose two evalua-tion criteria: (i) the actual average treatment effect size as presentedby the two analysis groups in an experiment setup, and (ii) the sen-sitivity of the experiment represented by the minimum detectableeffect (MDE) under a pre-specified test power. Both criteria arenecessary as the former indicates whether a setup suffers from met-ric dilution, whereas the latter indicates whether the setup suffersfrom lack of power/sample size. An ideal setup should yield a highactual effect size and a high sensitivity (i.e. a low MDE),4 thoughas we observe in the next section it is usually a trade-off.
We formally define the two evaluation criteria from first prin-ciples while introducing relevant notations along the way. Let π΄and π΅ be the two analysis groups in an experiment setup, with userresponses randomly distributed with mean and variance (`π΄, π2
π΄)
and (`π΅, π2π΅) respectively. We first recall that if there are sufficient
samples, the sample mean of the two groups approximately followsthe normal distribution by the Central Limit Theorem:
π΄approx.βΌ N
(`π΄, π
2π΄/ππ΄
), π΅
approx.βΌ N(`π΅, π
2π΅/ππ΅
), (1)
where ππ΄ and ππ΅ are the number of samples taken from π΄ and π΅
respectively. The difference in the sample means then also approxi-mately follows a normal distribution:
οΏ½ΜοΏ½ β (π΅ βπ΄) approx.βΌ N(Ξ β `π΅ β `π΄, π
2οΏ½ΜοΏ½β π2
π΄/ππ΄ + π2π΅/ππ΅
). (2)
Here, Ξ is the actual effect size that we are interested in.The definition of the MDE \β requires a primer to the power of
a statistical test. A common null hypothesis statistical test in per-sonalization strategy experiments uses the two-tailed hypothesesπ»0 : Ξ = 0 and π»1 : Ξ β 0, with the test statistic under π»0 being:
π β π·/ποΏ½ΜοΏ½approx.βΌ N (0, 1) . (3)
4We will use the terms βhigh(er) sensitivityβ and βlow(er) MDEβ interchangeably.
We recall the null hypothesis will be rejected if |π | > π§1βπΌ/2, whereπΌ is the significance level and π§1βπΌ/2 is the 1 β πΌ/2 quantile of astandard normal. Under a specific alternate hypothesis Ξ = \ , thepower is specified as
1 β π½\ β Pr(|π | > π§1βπΌ/2 | Ξ = \
)β 1 β Ξ¦
(π§1βπΌ/2 β |\ |/ποΏ½ΜοΏ½
). (4)
where Ξ¦ denotes the cumulative density function of a standardnormal.5 To achieve a minimum test power πmin, we require that1 β π½\ > πmin. Substituting Equation (4) into the inequality andrearranging to make \ the subject yields the effect sizes that thetest will be able to detect with the specified power:
|\ | > (π§1βπΌ/2 β π§1βπmin ) ποΏ½ΜοΏ½ . (5)
\β is then defined as the positive minimum \ that satisfies Inequal-ity (5), i.e. that specified by the RHS of the inequality.
We finally define what it means to be better under these evalu-ation criteria. WLOG we assume the actual effect size of the twocompeting experiment setups are positive,6 and say a setup π issuperior to another setup π if, all else being equal,(i) π produces a higher actual effect size (Ξπ > Ξπ ) and a lower
minimum detectable effect size (\βπ< \β
π ), or
(ii) The gain in actual effect is greater than the loss in sensitivity:
Ξπ β Ξπ > \βπ β \βπ , (6)
which means an actual effect still stands a higher chance to beobserved under π .
3 COMPARING EXPERIMENT SETUPSHaving described the evaluation framework above, in this sectionwe use the framework to compare the common experiment setupsdescribed in Section 2.2. We will first derive the actual effect sizeand MDE for each setup in Section 3.1, and using the result to createrule of thumbs on (i) whether diluting the metric by including userswho qualify for neither strategies is beneficial (Section 3.2) and (ii)if dual control is a better setup for personalization strategy experi-ments (Section 3.3), two questions that are often discussed amonge-commerce and marketing-focused experimenters. For brevity, werelegate most of the intermediate algebraic work when derivingthe actual & minimum detectable effect sizes, as well as the con-ditions that lead to a setup being superior, to our supplementarydocument.2
5The approximation in Equation (4) is tight for experiment design purposes, whereπΌ < 0.2 and 1 β π½ > 0.6 for nearly all cases.6If both the actual effect sizes are negative, we simply swap the analysis groups. If theactual effect sizes are of opposite signs, it is likely an error.
AdKDD β20, Aug 2020, Online Liu and McCoy
3.1 Actual & minimum detectable effect sizesWe first present the actual effect size and MDE of the four experi-ment setups. For each setup we first compute the sample size, meanresponse, and response variance in each analysis group, whicharises as a mixture of user groups described in Section 2.1. Forbrevity, we only present the quantities for one analysis group persetupβexpressions for other analysis group(s) can be easily ob-tained by substituting in the corresponding user groups. We thensubstitute the quantities computed into the definitions of Ξ (seeEquation (2)) and \β (see Inequality (5)) to obtain the setup-specificactual effect size and MDE. We assume all random splits are done50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). The setup randomly splitsuser group 3 into two analysis groups, each with π3/2 samples. Usersin analysis group π΄ are provided treatment under strategy 1, andhence the groupβs responses have a mean and variance of (`πΌπ , π2
πΌπ).
The actual effect size and MDE for Setup 1 are hence:
Ξπ1 = `πΌπ β `πΌπ , (7)
\βπ1 = (π§1βπΌ/2 β π§1βπmin )βοΈ
2(π2πΌπ
+ π2πΌπ)/π3 . (8)
Setup 2 (All samples). This setup also contains two analysisgroups, π΄ and π΅, each taking half of the population (i.e. (π0 + π1 +π2 + π3)/2). The mean response and response variance for groupsπ΄ and π΅ are the weighted mean response and response varianceof the constituent user groups respectively. As we only providetreatment to those who qualify for strategy 1 in group π΄, and like-wise for group π΅ with strategy 2, each user group will give differentresponses, e.g. for group π΄:
`π΄ = (π0`πΆ0 + π1`πΌ1 + π2`πΆ2 + π3`πΌπ )/(π0 + π1 + π2 + π3), (9)
π2π΄ = (π0π
2πΆ0 + π1π
2πΌ1 + π2π
2πΆ2 + π3π
2πΌπ)/(π0 + π1 + π2 + π3). (10)
Substituting the above (and that for group π΅) into the definitionsof actual effect size and MDE we have:
Ξπ2 =π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ )
π0 + π1 + π2 + π3, (11)
\βπ2 = (π§1βπΌ/2 β π§1βπmin )
ββββββ 2(π0 (2π2
πΆ0) + π1 (π2πΌ1 + π2
πΆ1)+π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ))
(π0 + π1 + π2 + π3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded. This leads to both anal-ysis groups having (π1 + π2 + π3)/2 users. The absence of group 0users means they are not featured in the weighted mean responseand response variance of the two analysis groups, e.g. for group A:
`π΄ =π1`πΌ1 + π2`πΆ2 + π3`πΌπ
π1 + π2 + π3, π2
π΄ =
π1π2πΌ1 + π2π2
πΆ2 + π3π2πΌπ
π1 + π2 + π3. (13)
This leads to the following actual effect size and MDE for Setup 3:
Ξπ3 =π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ )
π1 + π2 + π3, (14)
\βπ3 =(π§1βπΌ/2βπ§1βπmin )
βββ2(π1 (π2
πΌ1+π2πΆ1)+π2 (π2
πΆ2+π2πΌ2)+π3 (π2
πΌπ+π2
πΌπ))
(π1 + π2 + π3)2 .
(15)
Setup 4 (Dual control). The setup is the odd one out as it has fouranalysis groups. Two of the analysis groups (π΄1 and π΄2) are drawnfrom those who qualified into strategy 1 and are allocated into thefirst randomization group, and the other two (π΅1 and π΅2) are drawnfrom those who are qualified into strategy 2 and are allocated intothe second randomization group:
ππ΄1 = ππ΄2 = (π1 + π3)/4 , ππ΅1 = ππ΅2 = (π2 + π3)/4. (16)
The mean response and response variance for group π΄1 are:
`π΄1 =π1`πΆ1 + π3`πΆ3
π1 + π3, π2
π΄1 =π1π2
πΆ1 + π3π2πΆ3
π1 + π3. (17)
As the setup takes the difference of differences in the metric (i.e.the difference between the mean response for groups π΅2 and π΅1,and the difference between the mean response for groups π΄2 andπ΄1),7 the actual effect size is as follows:
Ξπ4 = (`π΅2 β `π΅1) β (`π΄2 β `π΄1)
=π2 (`πΌ2β`πΆ2) + π3 (`πΌπ β`πΆ3)
π2 + π3βπ2 (`πΌ1β`πΆ1) + π3 (`πΌπβ`πΆ3)
π1 + π3.
(18)
The MDE for Setup 4 is similar to that specified in RHS of Inequal-ity (5), albeit with more groups:
\βπ4 = (π§1βπΌ/2βπ§1βπmin )βοΈπ2π΄1/ππ΄1 + π2
π΄2/ππ΄2 + π2π΅1/ππ΅1 + π2
π΅2/ππ΅2
= 2 Β· (π§1βπΌ/2 β π§1βπmin )Γββπ1 (π2
πΆ1+π2πΌ1)+π3 (π2
πΆ3+π2πΌπ)
(π1 + π3)2 +π2 (π2
πΆ2+π2πΌ2)+π3 (π2
πΆ3+π2πΌπ)
(π2 + π3)2 .
(19)
3.2 Is dilution always bad?The use of responses from users who do not qualify for any ofthe strategies we are comparing, an act known as metric dilution,has stirred countless debates in experimentation teams. On onehand, responses from these users make any treatment effect lesspronounced by contributing exactly zero; on the other hand, itmight be necessary as one does not know who actually qualify fora strategy [9], or it might be desirable as they can be leveraged toreduce the variance of the treatment effect estimator [3].
Here, we are interested in whether we should engage in the actof dilution given the assumed user responses prior to an experi-ment. This can be clarified by understanding the conditions whereSetup 3 would emerge superior (as defined in Section 2.3) to Setup 2.By inspecting Equations (11) and (14), it is clear that Ξπ3 > Ξπ2
7Not to be confused with the difference-in-differences method, which captures themetric for the control and treatment groups at both the beginning (pre-intervention)and the end (post-intervention) of the experiment. Here we only capture the metriconce at the end of the experiment.
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
if π0 > 0. Thus, Setup 3 is superior to Setup 2 under the first crite-rion if \β
π3 < \βπ2, which is the case if π2
πΆ0, the metric variance ofusers who qualify for neither strategies, is large. This can be shownby substituting Equations (12) and (15) into the \β-inequality andrearranging the terms to obtain:(
π1 (π2πΌ1 + π2
πΆ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ))Β·
(π0 + 2π1 + 2π2 + 2π3)2(π1 + π2 + π3)2 < π2
πΆ0 .
(20)
If we assume the response variance are similar across groups withusers who qualified for at least one strategy, i.e. π2
πΌ1 β π2πΆ1 β Β· Β· Β· β
π2πΌπ
β π2π, Inequality (20) can then be simplified as
π2π
(π0
π1 + π2 + π3+ 2
)< π2
πΆ0 , (21)
where it can be used to quickly determine if one should considerdilution at all.
If Inequality (20) is not true (i.e. \βπ3 β₯ \β
π2), we should thenconsider when the second criterion (i.e. Ξπ3 β Ξπ2 > \β
π3 β \βπ2) is
met. Writing
[ = π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ ),b = π1 (π2
πΆ1 + π2πΌ1) + π2 (π2
πΌ2 + π2πΆ2) + π3 (π2
πΌπ+ π2
πΌπ), and
π§ = π§1βπΌ/2 β π§1βπmin ,
we can substitute Equations (11), (12), (14), and (15) into the in-equality and rearrange to obtain
π1 + π2 + π3π0
βοΈ2π0π2
πΆ0 + b >π0 + π1 + π2 + π3
π0
βοΈb β [
β2π§
. (22)
As the LHS of Inequality (22) is always positive, Setup 3 is superiorif the RHS β€ 0. Noting
Ξπ3 = [/(π1 + π2 + π3) and \βπ3 =β
2 Β· π§ Β·βοΈb/(π1 + π2 + π3),
the trivial case is satisfied ifπ0 + π1 + π2 + π3
π0Β· \βπ3 β€ Ξπ3 . (23)
If the RHS of Inequality (22) is positive, we can safely squareboth sides and use the identities for Ξπ3 and \β
π3 to get
2π2πΆ0π0
>
[(\βπ3 β Ξπ3 + π1+π2+π3
π0\βπ3
)2β
(π1+π2+π3
π0\βπ3
)2]
2π§2 . (24)
As the LHS is always positive, the second criterion is met if
\βπ3 β€ Ξπ3 . (25)
Note this is a weaker, and thus more easily satisfiable conditionthan that introduced in Inequality (23). This suggests an experimentsetup is always superior to an diluted alternative if the experiment isalready adequately poweredβintroducing any dilution will simplymake things worse.
Failing the condition in Inequality (25), we can always fall backto Inequality (24). While the inequality operates in squared space, itis essentially comparing the standard error of user group 0 (LHS)βthose who qualify for neither strategiesβto the gap between theminimum detectable and actual effects (\β
π3 β Ξπ3). The gap canbe interpreted as the existing noise level, thus a higher standard
error means mixing in group 0 users will introduce extra noise,and one is better off without them. Conversely, a smaller standarderror means group 0 users can lower the noise level, i.e. stabilizethe metric fluctuation, and one should take advantage of them.
To summarize, diluting a personalization strategies experimentsetup is not helpful if(i) Users who do not qualify for any strategies have a large metric
variance (Inequality (20)), or(ii) The experiment is already adequately powered (Inequality (25)).It could help if the experiment has not gained sufficient power yetand users who do not qualify for any strategy provide low-varianceresponses, such that they exhibit stabilizing effects when includedinto the analysis (complement of Inequality (24)).
3.3 When is a dual-control more effective?Often when advertisers compare two personalization strategies,the question on whether to use a dual control/multi-cell designcomes up. Proponents of such approach celebrate its ability to tella story by making the incrementality of an individual strategyavailable, while opponents voice concerns on the complexity insetting up the design. Here we are interested if Setup 4 (dual control)is superior to Setup 3 (a simple A/B test) from a power/detectableeffect perspective, and if so, under what circumstances.
We first observe \βπ4 > \
βπ3 is always true, and hence a dual control
setup will never be superior to a simpler setup under the first crite-rion. This can be verified by substituting in Equations (19) and (15)and rearranging the terms to show the inequality is equivalent to
2(
π1(π1+π3)2 (π2
πΆ1 + π2πΌ1) +
π2(π2+π3)2 (π2
πΆ2 + π2πΌ2)+
π3(π1+π3)2 π
2πΌπ
+ π3(π2+π3)2 π
2πΌπ
+(
π3(π1+π3)2 + π3
(π2+π3)2
)π2πΆ3
)>
π1(π1+π2+π3)2 (π2
πΆ1 + π2πΌ1) +
π2(π1+π2+π3)2 (π2
πΆ2 + π2πΌ2)+
π3(π1+π2+π3)2 π
2πΌπ
+ π3(π1+π2+π3)2 π
2πΌπ, (26)
which is always true given the πs are non-negative and the π2s arepositive: not only the coefficients of the π2-terms are larger on theLHS than their RHS counterparts, the LHS also carries an extra π2
πΆ3term with non-negative coefficient and a factor of two.
Moving on to the second evaluation criterion, we recall thatSetup 4 is superior if Ξπ4 β Ξπ3 > \β
π4 β \βπ3, otherwise Setup 3 is
superior under the same criterion. The full flexibility of the modelcan be seen by substituting Equations (14), (15), (18), and (19) intothe inequality and rearrange to obtain
π1π2 (`πΌ2β`πΆ2)+π3 (`πΌπβ`πΆ3)
π2+π3β π2
π1 (`πΌ1β`πΆ1)+π3 (`πΌπβ`πΆ3)π1+π3βοΈ
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
> (27)
β2π§
βββββββββ2 Β·
(1 + π2π1+π3
)2 [π1 (π2πΆ1 + π2
πΌ1) + π3 (π2πΆ3 + π2
πΌπ)]+
(1 + π1π2+π3
)2 [π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΆ3 + π2
πΌπ)]
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
β 1
,where π§ = π§1βπΌ/2 β π§1βπmin .
A key observation from inspecting Inequality (27) is that theLHS of the inequality scales alongπ (
βπ), where π is the number of
users, while the RHS remains a constant. This leads to the insight
AdKDD β20, Aug 2020, Online Liu and McCoy
that Setup 4 is more likely to be superior if the πs are large. Herewe assume the ratio π1 : π2 : π3 remains unchanged when we scalethe number of samples, an assumption that generally holds whenan organization increases their reach while maintaining their usermix. It is worth pointing out that our claim is stronger than thatin previous work β we have shown that having a large user basenot only fulfills the requirement of running a dual control experi-ment as described in [9], it also makes a dual control experiment abetter setup than its simpler counterparts in terms of apparent anddetectable effect sizes.
The scaling relationship can be seen more clearly if we applysome simplifying assumptions to the π2- and π-terms. If we as-sume the response variances are similar across user groups (i.e.π2πΆ1 β π2
πΌ1 β Β· Β· Β· β π2πΌπ
β π2π), the RHS of Inequality (27) becomes
β2π§
[βοΈπ1 + π2 + π3π1 + π3
+ π1 + π2 + π3π2 + π3
β 1], (28)
which remains a constant if the ratio π1 : π2 : π3 remains un-changed. If we assume the number of users in groups 1, 2, 3 aresimilar (i.e. π1 = π2 = π3 = π), the LHS of Inequality (27) becomes
βπ((`πΌ2 β `πΆ2) β (`πΌ1 β `πΆ1) + `πΌπ β `πΌπ
)2βοΈπ2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
, (29)
which clearly scales along π (βπ).
We conclude the section by providing an indication on what alarge π may look like. If we assume both the response variancesand the number of users are are similar across user groups, we canrearrange Inequality (27) to make π the subject:
π >
(2β
12(β
6 β 1)π§
)2 π2π
Ξ2 , (30)
where Ξ = (`πΌ2 β `πΆ2) β (`πΌ1 β `πΆ1) + `πΌπ β `πΌπ is the difference inactual effect sizes between Setups 4 and 3. Under a 5% significancelevel and 80% power, the first coefficient amounts to around 791,which is roughly 50 times the coefficient one would use to deter-mine the sample size of a simple A/B test [11]. This suggests a dualcontrol setup is perhaps a luxury accessible only to the largest ad-vertising platforms and their top advertisers. For example, consideran experiment to optimize conversion rate where the baselinesattain 20% (hence having a variance of 0.2(1 β 0.2) = 0.16). If thereis a 2.5% relative (i.e. 0.5% absolute) effect between the competingstrategies, the dual control setup will only be superior if there are> 5M users in each user group.
4 EXPERIMENTSHaving performed theoretical calculations for the actual and de-tectable effects and conditions where an experiment setup is supe-rior to another, here we verify those calculations using simulationresults. We focus on the results presented in Section 3.1, as the restof the results presented followed from those calculations.
In each experiment setup evaluation, we randomly select thevalue of the parameters (i.e. the `s, π2s, and πs), and take 1,000actual effect samples, each by (i) sampling the responses from theuser groups under the specified parameters, (ii) computing themeanfor the analysis groups, and (iii) taking the difference of the means.
Actual effect size Minimum detectable effectSetup 1 1049/1099 (95.45%) 66/81 (81.48%)Setup 2 853/999 (85.38%) 87/106 (82.08%)Setup 3 922/1099 (83.89%) 93/116 (80.18%)Setup 4 240/333 (72.07%) 149/185 (80.54%)
Table 2: Number of evaluations where the theoretical valueof the quantities (columns) falls between the 95% bootstrapconfidence interval for each experiment setup (rows). SeeSection 4 for a detailed description on the evaluations.
We also take 100 MDE samples in separate evaluations, eachby (i) sampling a critical value under null hypothesis; (ii) comput-ing the test power under a large number of possible effect sizes,each using the critical value and sampled metric means under thealternate hypothesis; and (iii) searching the effect size space forthe value that gives the predefined power. As the power vs. effectsize curve is noisy given the use of simulated power samples, weuse the bisection algorithm provided by the noisyopt package toperform the search. The algorithm dynamically adjusts the numberof samples taken from the same point on the curve to ensure thenoise does not send us down the wrong search space.
We expect the mean of the sampled actual effect and MDE tomatch the theoretical value. To verify this, we perform 1,000 boot-strap resamplings on the samples obtained above to obtain an em-pirical bootstrap distribution of the sample mean in each evaluation.The 95% bootstrap resampling confidence interval (BRCI) shouldthen contain the theoretical mean 95% of the times. The histogramof the percentile rank of the theoretical quantity in relation to thebootstrap samples across multiple evaluations should also follow auniform distribution [13].
The result is shown in Table 2. One can observe that there aremore evaluations having their theoretical quantity lying outsidethan the BRCI than expected. Upon further investigation, we ob-served a characteristic βͺ-shape from the histograms of the per-centile ranks for the actual effects. This suggests the bootstrapsamples may be under-dispersed but otherwise centered on thetheoretical quantities.
We also observed the histograms forMDEs curving upward to theright, this suggests that the theoretical value is a slight overestimate(of < 1% to the bootstrap mean in all cases). We believe this is likelydue to a small bias in the bisection algorithm. The algorithm testsif the mean of the power samples is less than the target power todecide which half of the search space to continue along. Given wecan bisect up to 10 times in each evaluation, it is likely to see afalse positive even when we set the significance level for individualcomparisons to 1%. This leads to the algorithm favoring a smallerMDE sample. Having that said, since we have tested for a widerange of parameters and the overall bias is small, we are satisfiedwith the theoretical quantities for experiment design purposes.
5 CONCLUSIONWe have addressed the problem of comparing experiment designsfor personalization strategies by presenting an evaluation frame-work that allows experimenters to evaluate which experiment setup
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
should be adopted given the situation. The flexible framework canbe easily extended to compare setups that compare more than twostrategies by adding more user groups (i.e. new sets to the Venndiagram in Figure 1). A new setup can also be incorporated quicklyas it is essentially a different weighting of user group-scenariocombinations shown in Table 1. The framework also allows thedevelopment of simple rule of thumbs such as:(i) Metric dilution should never be employed if the experiment
already has sufficient power; though it can be useful if theexperiment is under-powered and the non-qualifying usersprovide a βstabilizing effectβ; and
(ii) A dual control setup is superior to simpler setups only if onehas access to the user base of the largest organizations.
We have validated the theoretical results via simulations, and madethe code available2 so that practitioners can benefit from the resultsimmediately when designing their upcoming experiments.
Future Work. So far we assume the responses from each usergroup-scenario combination are randomly distributed with themean independent to the variance, with the evaluation criteria cal-culated assuming the metric (being the weighted sample mean ofthe responses) is approximately normally distributed under CentralLimit Theorem. We are interested in whether the evaluation frame-work and its results still hold if we deviate from these assumptions,e.g. with binary responses (where the mean and variance correlate)and heavy-tailed distributed responses (where the sample meanconverges to a normal distribution slowly).
ACKNOWLEDGMENTSThework is partially funded by the EPSRCCDT inModern Statisticsand Statistical Machine Learning at Imperial andOxford (StatML.IO)and ASOS.com. The authors thank the anonymous reviewers forproviding many improvements to the original manuscript.
REFERENCES[1] Brooke Bengier and Amanda Knupp. [n.d.]. Selling More Stuff: The What, Why
& How of Incrementality Testing. https://www.quantcast.com/blog/selling-more-stuff-the-what-why-how-of-incrementality-testing/. Blog post.
[2] Will Browne and Mike Swarbrick Jones. 2017. What Works in e-commerce - aMeta-analysis of 6700 Online Experiments. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf White paper.
[3] Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trig-ger Analysis in Online Controlled Experiments. In Proceedings of the EighthACM International Conference on Web Search and Data Mining (Shanghai, China)(WSDM β15). Association for Computing Machinery, New York, NY, USA, 349β358.https://doi.org/10.1145/2684822.2685307
[4] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sen-sitivity of Online Controlled Experiments by Utilizing Pre-experiment Data.In Proceedings of the Sixth ACM International Conference on Web Search andData Mining (Rome, Italy) (WSDM β13). ACM, New York, NY, USA, 123β132.https://doi.org/10.1145/2433396.2433413
[5] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A DirtyDozen: Twelve Common Metric Interpretation Pitfalls in Online ControlledExperiments. In Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD β17). ACM,New York, NY, USA, 1427β1436.
[6] HenningHohnhold, Deirdre OβBrien, andDiane Tang. 2015. Focusing on the Long-term: Itβs Good for Users and Business. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (Sydney, NSW,Australia) (KDD β15). ACM, New York, NY, USA, 1849β1858.
[7] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking atA/B Tests: Why It Matters, and What to Do About It. In Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Halifax, NS, Canada) (KDD β17). ACM, New York, NY, USA, 1517β1525.
[8] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19thACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Chicago, Illinois, USA) (KDD β13). ACM, New York, NY, USA, 1168β1176.
[9] C. H. Bryan Liu, Elaine M. Bettaney, and Benjamin Paul Chamberlain.2018. Designing Experiments to Measure Incrementality on Facebook.arXiv:1806.02588. [stat.ME] 2018 AdKDD & TargetAd Workshop.
[10] C. H. Bryan Liu, Benjamin Paul Chamberlain, and Emma J. McCoy. 2020. What isthe Value of Experimentation and Measurement?: Quantifying the Value and Riskof Reducing Uncertainty to Make Better Decisions. Data Science and Engineering5, 2 (2020), 152β167. https://doi.org/10.1007/s41019-020-00121-5
[11] EvanMiller. 2010. HowNot To Run anA/B Test. https://www.evanmiller.org/how-not-to-run-an-ab-test.html. Blog post.
[12] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and PavelSerdyukov. 2016. Boosted Decision Tree Regression Adjustment for VarianceReduction in Online Controlled Experiments. In Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (SanFrancisco, California, USA) (KDD β16). ACM, New York, NY, USA, 235β244. https://doi.org/10.1145/2939672.2939688
[13] Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gel-man. 2018. Validating Bayesian Inference Algorithms with Simulation-BasedCalibration. arXiv:1804.06788 [stat.ME]
[14] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.From Infrastructure to Culture: A/B Testing Challenges in Large Scale SocialNetworks. In Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD β15). ACM,New York, NY, USA, 2227β2236.
AdKDD β20, Aug 2020, Online Liu and McCoy
SUPPLEMENTARY DOCUMENTIn this supplementary document, we revisit Section 3 of the paperβAn Evaluation Framework for Personalization Strategy ExperimentDesignβ by Liu and McCoy (2020), and expand on the intermediatealgebraic work when deriving:(i) The actual & minimum detectable effect sizes, and(ii) The conditions that lead to a setup being superior.We will use the equation numbers featured in the original paper,and append letters for intermediate steps.
A EFFECT SIZE OF EXPERIMENT SETUPSWe begin with the actual effect size and the MDE of the four ex-periment setups that are featured in Section 3.1 of the paper. Foreach setup we first compute the sample size, mean response, andresponse variance in each analysis group (denoted ππ , `π , and π2
π
respectively for each analysis group π). These quantities arise asa mixture of user groups described in Section 2.1 of the paper. Wethen substitute the quantities computed into the definitions of Ξand \β:
Ξ β `π΅ β `π΄, (see (2))
\β β (π§1βπΌ/2 β π§1βπmin )ποΏ½ΜοΏ½ , where ποΏ½ΜοΏ½ = π2π΄/ππ΄ + π2
π΅/ππ΅ (see (5))
and π§π represents the πth quantile of a standard normal, to obtainthe setup-specific actual effect size and MDE for a setup with twoanalysis groups. For setups with more than two analysis groups,we will specify the actual and minimum detectable effect when wediscuss specifics for each of the setups. We assume all random splitsare done 50/50 in these setups to maximize the test power.
Setup 1 (Users in the intersection only). We recall the setup, whichconsiders only users who qualify for both personalization strategies(i.e. the intersection), randomly splits user group 3 into two analysisgroups, π΄ and π΅, each with the following number of samples:
ππ΄ = ππ΅ =π32.
Users in analysis group π΄ are provided treatment under strategy 1,and users in analysis group π΅ are provided treatment under strat-egy 2. This leads to the groupsβ responses having the followingmetric mean and variance:
`π΄ = `πΌπ , `π΅ = `πΌπ , π2π΄ = π2
πΌπ, π2
π΅ = π2πΌπ.
The actual effect size and MDE for Setup 1 are hence:
Ξπ1 = `πΌπ β `πΌπ , (7)
\βπ1 = (π§1βπΌ/2 β π§1βπmin )
ββπ2πΌπ
π3/2+π2πΌπ
π3/2. (8)
Setup 2 (All samples). The setup, which considers all users regard-less of whether they qualify for any strategy or not, also containstwo analysis groups, π΄ and π΅, each taking half of the population:
ππ΄ = ππ΅ =π0 + π1 + π2 + π3
2.
The mean response and response variance for groups π΄ and π΅
are the weighted mean response and response variance of theconstituent user groups respectively, weighted by the constituentgroupsβ size. As we only provide treatment to those who qualify
for strategy 1 in group A, and those who qualify for strategy 2 ingroup B, this leads to different responses in different constituentuser groups:
`π΄ =π0`πΆ0 + π1`πΌ1 + π2`πΆ2 + π3`πΌπ
π0 + π1 + π2 + π3, (9)
`π΅ =π0`πΆ0 + π1`πΆ1 + π2`πΌ2 + π3`πΌπ
π0 + π1 + π2 + π3;
π2π΄ =
π0π2πΆ0 + π1π2
πΌ1 + π2π2πΆ2 + π3π2
πΌπ
π0 + π1 + π2 + π3, (10)
π2π΅ =
π0π2πΆ0 + π1π2
πΆ1 + π2π2πΌ2 + π3π2
πΌπ
π0 + π1 + π2 + π3.
Substituting the above into the definitions of actual effect sizeand MDE and simplifying the resultant expressions we have:
Ξπ2 =π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ )
π0 + π1 + π2 + π3, (11)
\βπ2 = (π§1βπΌ/2 β π§1βπmin )
ββββββ 2(π0 (2π2
πΆ0) + π1 (π2πΌ1 + π2
πΆ1)+π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ))
(π0 + π1 + π2 + π3)2 .
(12)
Setup 3 (Qualified users only). The setup is very similar to Setup 2,with members from user group 0 excluded:
ππ΄ = ππ΅ =π1 + π2 + π3
2.
The absence of members from user group 0 means they are notfeatured in the weighted mean response and response variance ofthe two analysis groups:
`π΄ =π1`πΌ1 + π2`πΆ2 + π3`πΌπ
π1 + π2 + π3, `π΅ =
π1`πΆ1 + π2`πΌ2 + π3`πΌπ
π1 + π2 + π3;
π2π΄ =
π1π2πΌ1 + π2π2
πΆ2 + π3π2πΌπ
π1 + π2 + π3, π2
π΅ =
π1π2πΆ1 + π2π2
πΌ2 + π3π2πΌπ
π1 + π2 + π3. (13)
This lead to the following actual effect size and MDE for Setup 3:
Ξπ3 =π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ )
π1 + π2 + π3, (14)
\βπ3 =(π§1βπΌ/2βπ§1βπmin )
βββ2(π1 (π2
πΌ1+π2πΆ1)+π2 (π2
πΆ2+π2πΌ2)+π3 (π2
πΌπ+π2
πΌπ))
(π1 + π2 + π3)2 .
(15)
Setup 4 (Dual control). Setup 4 is unique amongst the experimentsetups introduced as it has four analysis groups. Two of the analysisgroups are drawn from those who qualified into strategy 1 and areallocated into the first half, and the other two are drawn from thosewho are qualified into strategy 2 and are allocated into the secondhalf:
ππ΄1 = ππ΄2 =π1 + π3
4, ππ΅1 = ππ΅2 =
π2 + π34
. (16)
The mean response and response variance of each analysis groupare the weighted metric mean response and response variance of
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
the user groups involved respectively:
`π΄1 =π1`πΆ1 + π3`πΆ3
π1 + π3, `π΄2 =
π1`πΌ1 + π3`πΌπ
π1 + π3,
`π΅1 =π1`πΆ2 + π3`πΆ3
π2 + π3, `π΅2 =
π1`πΌ2 + π3`πΌπ
π2 + π3;
π2π΄1 =
π1π2πΆ1 + π3π2
πΆ3π1 + π3
, π2π΄2 =
π1π2πΌ1 + π3π2
πΌπ
π1 + π3,
π2π΅1 =
π1π2πΆ2 + π3π2
πΆ3π2 + π3
, π2π΅2 =
π1π2πΌ2 + π3π2
πΌπ
π2 + π3; (17)
As the setup takes the difference of differences (i.e. the differenceof groups π΅2 and π΅1, and the difference of groups π΄2 and π΄1), theactual effect size are specified, post-simplification, as follows:
Ξπ4 = (`π΅2 β `π΅1) β (`π΄2 β `π΄1)
=π2 (`πΌ2β`πΆ2) + π3 (`πΌπ β`πΆ3)
π2 + π3βπ2 (`πΌ1β`πΆ1) + π3 (`πΌπβ`πΆ3)
π1 + π3.
(18)
The MDE for Setup 4 is similar to that specified in the RHS ofInequality (5), albeit with more groups:
\βπ4 = (π§1βπΌ/2βπ§1βπmin )βοΈπ2π΄1/ππ΄1 + π2
π΄2/ππ΄2 + π2π΅1/ππ΅1 + π2
π΅2/ππ΅2
= 2 Β· (π§1βπΌ/2 β π§1βπmin )Γββπ1 (π2
πΆ1+π2πΌ1)+π3 (π2
πΆ3+π2πΌπ)
(π1 + π3)2 +π2 (π2
πΆ2+π2πΌ2)+π3 (π2
πΆ3+π2πΌπ)
(π2 + π3)2 .
(19)
B METRIC DILUTIONWe then expand the calculations in Section 3.2 of the paper, wherewe discuss the conditions where an experiment setup with met-ric dilution (Setup 2) will emerge superior to one without metricdilution (Setup 3), and vice versa.
B.1 The first criterionWe first show \β
π3 < \βπ2, the condition which will lead to Setup 3
being superior to Setup 2 under the first criterion, is equivalent to(π1 (π2
πΌ1 + π2πΆ1) + π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ))Β·
(π0 + 2π1 + 2π2 + 2π3)2(π1 + π2 + π3)2 < π2
πΆ0 . (20)
We start by substituting the expressions for \βπ2 (Equation (12))
and \βπ3 (Equation (15)) into the inequality \β
π3 < \βπ2 to obtain
(π§1βπΌ/2βπ§1βπmin )
ββ2(π1 (π2
πΌ1+π2πΆ1)+π2 (π2
πΆ2+π2πΌ2)+π3 (π2
πΌπ+π2
πΌπ))
(π1 + π2 + π3)2
< (π§1βπΌ/2 β π§1βπmin )
ββββββ 2(π0 (2π2
πΆ0) + π1 (π2πΌ1 + π2
πΆ1)+π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ))
(π0 + π1 + π2 + π3)2 .
(20a)
Canceling the π§1βπΌ/2 βπ§1βπmin andβ
2 terms on both sides, and thensquaring them yields
π1 (π2πΌ1 + π2
πΆ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
(π1 + π2 + π3)2
<π0 (2π2
πΆ0) + π1 (π2πΌ1 + π2
πΆ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
(π0 + π1 + π2 + π3)2 .
(20b)
We then write b = π1 (π2πΌ1 +π
2πΆ1) +π2 (π2
πΆ2 +π2πΌ2) +π3 (π2
πΌπ+π2
πΌπ)
and move the b terms on the RHS to the LHS:
b
(1
(π1 + π2 + π3)2 β 1(π0 + π1 + π2 + π3)2
)<
π0 (2π2πΆ0)
(π0 + π1 + π2 + π3)2 .
(20c)
As the partial fractions can be consolidated as1
(π1 + π2 + π3)2 β 1(π0 + π1 + π2 + π3)2
=(π0 + π1 + π2 + π3)2 β (π1 + π2 + π3)2
(π1 + π2 + π3)2 (π0 + π1 + π2 + π3)2
=(π0 + 2π1 + 2π2 + 2π3)π0
(π1 + π2 + π3)2 (π0 + π1 + π2 + π3)2 , (20d)
where the second step utilizes the identity π2 β π2 = (π + π) (π β π),Inequality (20c) can be written as
b
((π0 + 2π1 + 2π2 + 2π3)π0
(π1 + π2 + π3)2 (π0 + π1 + π2 + π3)2
)<
π0 (2π2πΆ0)
(π0 + π1 + π2 + π3)2 .
(20e)
We finally cancel the π0 and (π0 +π1 +π2 +π3)2 terms on both sidesof Inequality (20e), move the factor of two to the LHS, and write bin its full form to arrive at Inequality (20):(
π1 (π2πΌ1 + π2
πΆ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ))Β·
(π0 + 2π1 + 2π2 + 2π3)2(π1 + π2 + π3)2 < π2
πΆ0 .
B.2 The second criterionIn the case where Inequality (20) does not hold, we consider whenSetup 3 will emerge superior to Setup 2 under the second criterion:
Ξπ3 β Ξπ2 > \βπ3 β \βπ2 .
If this inequality does not hold either (and both sides are not equal),we consider Setup 2 as superior to Setup 3 under the same criterionas the following holds:
Ξπ3 β Ξπ2 < \βπ3 β \βπ2 ββ Ξπ2 β Ξπ3 > \βπ2 β \βπ3 .
The master inequality. We first show that the inequality Ξπ3 βΞπ2 > \β
π3 β \βπ2 is equivalent to
π1 + π2 + π3π0
βοΈ2π0π2
πΆ0 + b >π0 + π1 + π2 + π3
π0
βοΈb β [
β2π§
, (22)
where
[ = π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ ),b = π1 (π2
πΆ1 + π2πΌ1) + π2 (π2
πΌ2 + π2πΆ2) + π3 (π2
πΌπ+ π2
πΌπ), and
π§ = π§1βπΌ/2 β π§1βπmin .
AdKDD β20, Aug 2020, Online Liu and McCoy
Writing [, b , and π§ as shown above, we substitute in the expres-sions for Ξπ2, \βπ2, Ξπ3, and \βπ3 (Equations (11), (12), (14), and (15)respectively) into the initial inequality to obtain
[
π1 + π2 + π3β [
π0 + π1 + π2 + π3
> π§
βοΈ2b
(π1 + π2 + π3)2 β π§
βοΈ2(π0 (2π2
πΆ0) + b)
(π0 + π1 + π2 + π3)2 . (22a)
Pulling out the common factors on each side we have
[
(1
π1 + π2 + π3β 1π0 + π1 + π2 + π3
)>β
2π§
( βοΈb
π1 + π2 + π3β
βοΈ2π0π2
πΆ0 + b
π0 + π1 + π2 + π3
). (22b)
Writing the partial fraction on the LHS of Inequality (22b) as acomposite fraction we have
[
(π0
(π1 + π2 + π3) (π0 + π1 + π2 + π3)
)>β
2π§
( βοΈb
π1 + π2 + π3β
βοΈ2π0π2
πΆ0 + b
π0 + π1 + π2 + π3
). (22c)
We then move the composite fraction to the RHS and theβ
2π§ termto the LHS:
[β
2π§>
(π1 + π2 + π3)Β·(π0 + π1 + π2 + π3)
π0
( βοΈb
π1 + π2 + π3β
βοΈ2π0π2
πΆ0 + b
π0 + π1 + π2 + π3
),
(22d)
and expand the brackets, canceling terms that appear on both sidesof the fractions in the RHS:
[β
2π§>
π0 + π1 + π2 + π3π0
βοΈb β π1 + π2 + π3
π0
βοΈ2π0π2
πΆ0 + b . (22e)
Finally, we swap the position of the leftmost term with that of therightmost term in Inequality (22e) to arrive at Inequality (22):
π1 + π2 + π3π0
βοΈ2π0π2
πΆ0 + b >π0 + π1 + π2 + π3
π0
βοΈb β [
β2π§
.
The trivial case: RHS β€ 0. We first observe that the LHS of In-equality (22) is always positive, and hence the inequality triviallyholds if the RHS is non-positive. Here we show RHS β€ 0 is equiva-lent to
π0 + π1 + π2 + π3π0
\βπ3 β€ Ξπ3 . (23)
The can be done by writing RHS β€ 0 in full:π0 + π1 + π2 + π3
π0
βοΈb β [
β2π§
β€ 0, (23a)
and moving the second term on the LHS to the RHS:π0 + π1 + π2 + π3
π0
βοΈb β€ [
β2π§
. (23b)
We then add a factor ofβ
2π§/(π1 + π2 + π3) on both sides to get
π0 + π1 + π2 + π3π0
βοΈbβ
2π§π1 + π2 + π3
β€ [
π1 + π2 + π3. (23c)
Noting from Equations (14) and (15) that
Ξπ3 =[
π1 + π2 + π3and \βπ3 =
β2 Β· π§ Β·
βοΈb
π1 + π2 + π3,
we finally replace the terms in Inequality (23c) with Ξπ3 and \βπ3 to
arrive at Inequality (23):π0 + π1 + π2 + π3
π0\βπ3 β€ Ξπ3 .
The non-trivial case: RHS > 0. We then tackle the case where theRHS of the master inequality (Inequality (22)) is greater than zero.We show in this non-trivial case, Inequality (22) is equivalent to
2π2πΆ0π0
>
(\βπ3 β Ξπ3 + π1+π2+π3
π0\βπ3
)2β
(π1+π2+π3
π0\βπ3
)2
2π§2 . (24)
We first multiply both sides of Inequality (22) with the fractionπ0β
2π§/(π1 + π2 + π3) to get
π1 + π2 + π3π0
βοΈ2π0π2
πΆ0 + bπ0β
2π§π1 + π2 + π3
>
(π0 + π1 + π2 + π3
π0
βοΈb β [
β2π§
)π0β
2π§π1 + π2 + π3
. (24a)
Canceling terms on both sides of the fractions we haveβοΈ2π0π2
πΆ0 + bβ
2π§
> (π0 + π1 + π2 + π3)βοΈbβ
2π§π1 + π2 + π3
β π0[
π1 + π2 + π3. (24b)
Again noting the identities for Ξπ3 and \βπ3 stated above, we can
replace the fractions on the RHS and obtainβοΈ2π0π2
πΆ0 + bβ
2π§ > (π0 + π1 + π2 + π3)\βπ3 β π0Ξπ3 . (24c)
We then square both sides of Inequality (24c) and move the 2π§2
term to the RHS:
2π0π2πΆ0 + b >
((π0 + π1 + π2 + π3)\βπ3 β π0Ξπ3
)2
2π§2 . (24d)
Note the squaring still allows the implication to go both ways asboth sides of Inequality (24c) are positive. Based on the identity for\βπ3, we observe b can also be written as
b =(π1 + π2 + π3)2 (\β
π3)2
2π§2 . (24e)
Thus, we can group all terms with a 2π§2 denominator by moving bin Inequality (24d) to the RHS and substituting Equation (24e) intothe resultant inequality:
2π0π2πΆ0 >
((π0 + π1 + π2 + π3)\βπ3 β π0Ξπ3
)2β((π1 + π2 + π3)\βπ3
)2
2π§2 .
(24f)
We finally normalize the inequality to one with unit Ξπ3 and \βπ3to enable effective comparison. We divide both sides of Inequal-ity (24f) by π02:
2π2πΆ0π0
>
(π0+π1+π2+π3
π0\βπ3 β Ξπ3
)2β
(π1+π2+π3
π0\βπ3
)2
2π§2 , (24g)
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
and split the coefficient of \βπ3 in the first squared term into an
integer (1) and a fractional ((π1 + π2 + π3)/π0) part to arrive atInequality (24):
2π2πΆ0π0
>
(\βπ3 β Ξπ3 + π1+π2+π3
π0\βπ3
)2β
(π1+π2+π3
π0\βπ3
)2
2π§2 .
C DUAL CONTROLWe finally clarify the calculations in Section 3.3 of the paper, wherewe determine the sample size required for Setup 4 (aka a dual controlsetup) to emerge superior to Setup 3 (a simpler A/B test setup). Inthe paper we showed that \β
π4 > \βπ3 always holds, and hence
Setup 4 will never be superior to Setup 3 under the first evaluationcriterion. We thus focus on the second evaluation criterion Ξπ4 βΞπ3 > \β
π4 β \βπ3, and first show that the criterion is equivalent to
Inequality (27). Assuming the ratio of user group sizes π1 : π2 : π3remains unchanged, we then show how(i) The RHS of the Inequality (27) remains a constant and(ii) The LHS of the Inequality (27) scales along π (
βπ), where π is
the number of users.The results mean Setup 4 could emerge superior to Setup 3 if wehave sufficiently large number of users. We also show from theinequality that(iii) The number of users required for a dual control setup to emerge
superior to simpler setups is accessible only to the largest orga-nizations and their top affiliates.
The master inequality. We first show the criterion Ξπ4 β Ξπ3 >
\βπ4 β \β
π3 is equivalent to
π1π2 (`πΌ2β`πΆ2)+π3 (`πΌπβ`πΆ3)
π2+π3β π2
π1 (`πΌ1β`πΆ1)+π3 (`πΌπβ`πΆ3)π1+π3βοΈ
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
> (27)
β2π§
βββββββββ2 Β·
(1 + π2π1+π3
)2 [π1 (π2πΆ1 + π2
πΌ1) + π3 (π2πΆ3 + π2
πΌπ)]+
(1 + π1π2+π3
)2 [π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΆ3 + π2
πΌπ)]
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
β 1
,where π§ = π§1βπΌ/2 β π§1βπmin . The number of terms involved is large,and hence we first simplify the LHS and RHS independently, andcombine them in the final step.
For the LHS (i.e. Ξπ4 β Ξπ3), we substitute in Equations (14)and (18) to obtainπ2 (`πΌ2β`πΆ2) + π3 (`πΌπ β`πΆ3)
π2 + π3βπ2 (`πΌ1β`πΆ1) + π3 (`πΌπβ`πΆ3)
π1 + π3β
π1 (`πΆ1 β `πΌ1) + π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΌπ )π1 + π2 + π3
. (27a)
The expression can be rewritten in terms of multiplicative productsbetween the π-terms and the (difference between) `-terms:
π1 (`πΌ1 β `πΆ1)[β 1
π1+π3+ 1
π1+π2+π3
]+
π2 (`πΌ2 β `πΆ2)[ 1π2+π3
β 1π1+π2+π3
]+
π3`πΌπ[ 1π2+π3
β 1π1+π2+π3
]+ π3`πΌπ
[β 1
π1+π3+ 1
π1+π2+π3
]+
π3`πΆ3[β 1
π2+π3+ 1
π1+π3
]. (27b)
We then extract a 1/(π1 + π2 + π3) term from Expression (27b):
(π1 + π2 + π3)β1 [π1 (`πΌ1 β `πΆ1)(β π1+π2+π3
π1+π3+ 1
)+
π2 (`πΌ2 β `πΆ2)(π1+π2+π3
π2+π3β 1
)+
π3`πΌπ(π1+π2+π3
π2+π3β 1
)+ π3`πΌπ
(β π1+π2+π3
π1+π3+ 1
)+
π3`πΆ3(β π1+π2+π3
π2+π3+ π1+π2+π3
π1+π3
) ]. (27c)
This allows us to perform some cancellation with the RHS, whichalso has a 1/(π1 + π2 + π3) term, in the final step. Noting
π1 + π2 + π3π1 + π3
= 1 + π2π1 + π3
andπ1 + π2 + π3π2 + π3
= 1 + π1π2 + π3
,
we can write the inner square brackets as
(π1 + π2 + π3)β1 [π1 (`πΌ1 β `πΆ1)(β π2
π1+π3
)+ π2 (`πΌ2 β `πΆ2)
( π1π2+π3
)+
π3`πΌπ( π1π2+π3
)+ π3`πΌπ
(β π2
π1+π3
)+
π3`πΆ3(β 1 β π1
π2+π3+ 1 + π2
π1+π3
) ], (27d)
and group the π1/(π2 + π3) and π2/(π1 + π3) terms to arrive at1
π1 + π2 + π3
[ π1π2 + π3
(π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΆ3)
)β
π2π1 + π3
(π1 (`πΌ1 β `πΆ1) + π3 (`πΌπ β `πΆ3)
) ]. (27e)
For the RHS (i.e. \βπ4 β \β
π3), we substitute in Equations (15)and (19) to obtain
2π§
ββπ1 (π2
πΆ1+π2πΌ1) + π3 (π2
πΆ3+π2πΌπ)
(π1 + π3)2 +π2 (π2
πΆ2+π2πΌ2) + π3 (π2
πΆ3+π2πΌπ)
(π2 + π3)2
ββ
2π§
ββπ1 (π2
πΌ1 + π2πΆ1) + π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ)
(π1 + π2 + π3)2 , (27f)
where π§ = π§1βπΌ/2 β π§1βπmin . We then extract aβ
2π§/(π1 + π2 + π3)term from Expression (27f) to arrive at
β2π§
π1 + π2 + π3
[β
2
βββ(π1+π2+π3π1+π3
)2 [π1 (π2πΆ1+π
2πΌ1) + π3 (π2
πΆ3+π2πΌπ)]+(π1+π2+π3
π2+π3
)2 [π2 (π2πΆ2+π
2πΌ2) + π3 (π2
πΆ3+π2πΌπ)]
β
βοΈπ1 (π2
πΆ1 + π2πΌ1) + π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ)],
(27g)
where (π1 +π2 +π3)/(π2 +π3) and (π1 +π2 +π3)/(π1 +π3) can alsobe written as 1 + π1/(π2 + π3) and 1 + π2/(π1 + π3) respectively.
We finally combine both sides of the inequality by taking Ex-pressions (27e) and (27g):
1π1 + π2 + π3
[ π1π2 + π3
(π2 (`πΌ2 β `πΆ2) + π3 (`πΌπ β `πΆ3)
)β
π2π1 + π3
(π1 (`πΌ1 β `πΆ1) + π3 (`πΌπ β `πΆ3)
) ]>
β2π§
π1 + π2 + π3
[β
2
βββ(1 + π2
π1+π3
)2 [π1 (π2πΆ1+π
2πΌ1) + π3 (π2
πΆ3+π2πΌπ)]+(
1 + π1π2+π3
)2 [π2 (π2πΆ2+π
2πΌ2) + π3 (π2
πΆ3+π2πΌπ)]β
βοΈπ1 (π2
πΆ1 + π2πΌ1) + π2 (π2
πΆ2 + π2πΌ2) + π3 (π2
πΌπ+ π2
πΌπ)].
(27h)
AdKDD β20, Aug 2020, Online Liu and McCoy
Canceling the common 1/(π1 +π2 +π3) terms on both sides, and di-viding both sides by
βοΈπ1 (π2
πΆ1+π2πΌ1) + π2 (π2
πΆ2+π2πΌ2) + π3 (π2
πΌπ+π2
πΌπ)
leads us to Inequality (27):
π1π2 (`πΌ2β`πΆ2)+π3 (`πΌπβ`πΆ3)
π2+π3β π2
π1 (`πΌ1β`πΆ1)+π3 (`πΌπβ`πΆ3)π1+π3βοΈ
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
>
β2π§
βββββββββ2 Β·
(1 + π2π1+π3
)2 [π1 (π2πΆ1 + π2
πΌ1) + π3 (π2πΆ3 + π2
πΌπ)]+
(1 + π1π2+π3
)2 [π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΆ3 + π2
πΌπ)]
π1 (π2πΆ1 + π2
πΌ1) + π2 (π2πΆ2 + π2
πΌ2) + π3 (π2πΌπ
+ π2πΌπ)
β 1
.RHS remains a constant. We then simplify the π2-terms in In-
equality (27) by assuming that they are similar in magnitude, i.e.
π2πΆ1 β π2
πΌ1 β Β· Β· Β· β π2πΌπ
β π2π ,
and show the RHS of the inequality is equal to
β2π§
[βοΈπ1 + π2 + π3π1 + π3
+ π1 + π2 + π3π2 + π3
β 1]. (28)
As long as the group size ratio π1 : π2 : π3 remains unchanged,Expression (28) will remain a constant. It is safe to apply the sim-plifying assumption as we know from the evaluation frameworkspecification that there are three classes of parameters in the in-equality: the user group sizes (π), the mean responses (`), and theresponse variances (π2). Among these three classes of parameters,only the user group sizes have the potential to scale in any practicalsettings, and thus we can effectively treat the means and variancesas constants below.
We begin by substituting π2πinto Inequality (27):
π1π2 (`πΌ2β`πΆ2)+π3 (`πΌπβ`πΆ3)
π2+π3β π2
π1 (`πΌ1β`πΆ1)+π3 (`πΌπβ`πΆ3)π1+π3βοΈ
π1 (2π2π) + π2 (2π2
π) + π3 (2π2
π)
>
β2π§
βββββββ
2 Β·
(1 + π2π1+π3
)2 [π1 (2π2π) + π3 (2π2
π)]+
(1 + π1π2+π3
)2 [π2 (2π2π) + π3 (2π2
π)]
π1 (2π2π) + π2 (2π2
π) + π3 (2π2
π)
β 1
. (28a)
Moving the common 2π2πterms out and canceling the common
terms in the RHS fraction we have
π1π2 (`πΌ2β`πΆ2)+π3 (`πΌπβ`πΆ3)
π2+π3β π2
π1 (`πΌ1β`πΆ1)+π3 (`πΌπβ`πΆ3)π1+π3βοΈ
2π2π(π1 + π2 + π3)
>
β2π§
[βοΈ2 Β·
(1+ π2π1+π3
)2 (π1 + π3) + (1+ π1π2+π3
)2 (π2 + π3)(π1 + π2 + π3)
β 1
].
(28b)
We can already see the LHS of Inequality (28b) scales alongπ (βπ)β
we will demonstrate this result in greater detail below.Focusing on the RHS of the inequality, we express the squared
terms as rational fractions and divide each term in the numerator
by the denominator to obtain
β2π§
βββ
2
[(π1+π2+π3π1 + π3
)2π1 + π3
π1+π2+π3+(π1+π2+π3π2 + π3
)2π2 + π3
π1+π2+π3
]β1
.(28c)
Canceling the common π1 + π2 + π3 terms leads to that presentedin Expression (28):
β2π§
[βοΈπ1 + π2 + π3π1 + π3
+ π1 + π2 + π3π2 + π3
β 1].
LHS scales along π (βπ). We demonstrate the scaling relation
between the LHS of Inequality (27) and the number of users in eachgroup by simplifying the π-terms (but not the π2-terms as above),assuming π1 β π2 β π3 β π, and showing the LHS of the inequalityis equal to
βπ((`πΌ2 β `πΆ2) β (`πΌ1 β `πΆ1) + `πΌπ β `πΌπ
)2βοΈπ2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
. (29)
While the relationship (that the LHS of the inequality scales alongπ (
βπ)) is evident by inspecting the LHS of Inequality (28b) or even
Inequality (27) itself, we believe the simplification allows us to showthe relationship more clearly.
We begin by substituting π into Inequality (27) to obtain
ππ (`πΌ2β`πΆ2)+π (`πΌπβ`πΆ3)
π+π β ππ (`πΌ1β`πΆ1)+π (`πΌπβ`πΆ3)
π+πβοΈπ(π2
πΆ1 + π2πΌ1) + π(π
2πΆ2 + π2
πΌ2) + π(π2πΌπ
+ π2πΌπ)
> (29a)
β2π§
βββββββββ2 Β·
(1 + ππ+π )
2 [π(π2πΆ1 + π2
πΌ1) + π(π2πΆ3 + π2
πΌπ)]+
(1 + ππ+π )
2 [π(π2πΆ2 + π2
πΌ2) + π(π2πΆ3 + π2
πΌπ)]
π(π2πΆ1 + π2
πΌ1) + π(π2πΆ2 + π2
πΌ2) + π(π2πΌπ
+ π2πΌπ)
β 1
.Moving the common π-terms out and canceling them in the frac-tions where appropriate lead to
βπ 1
2[((`πΌ2 β `πΆ2) + (`πΌπ β `πΆ3)) β ((`πΌ1 β `πΆ1) + (`πΌπ β `πΆ3))
]βοΈπ2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
>
β2π§
βββββββββ2 Β·
(1 + 12 )
2 (π2πΆ1 + π2
πΌ1 + π2πΆ3 + π2
πΌπ)+
(1 + 12 )
2 (π2πΆ2 + π2
πΌ2 + π2πΆ3 + π2
πΌπ)
π2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
β 1
, (29b)
where the LHS is equal to Expression (29) as claimed above.It is clear that there are no π-terms left on the RHS of Inequal-
ity (29), and hence the RHS remains a constant as shown previously.Setting up the inequality to demonstrate the third resultβthat thenumber of users required for a dual control setup to emerge su-perior is largeβwe further simplify the RHS of the inequality by
An Evaluation Framework for Personalization Strategy Experiment Designs AdKDD β20, Aug 2020, Online
rearranging the terms in the square root:βπ
[((`πΌ2 β `πΆ2) + (`πΌπ β `πΆ3)) β ((`πΌ1 β `πΆ1) + (`πΌπ β `πΆ3))
]2βοΈπ2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
>
β2π§
[βββ2(
32
)2 (1 +
2π2πΆ3
π2πΆ1 + π2
πΌ1 + π2πΆ2 + π2
πΌ2 + π2πΌπ
+ π2πΌπ
)β 1
].
(29c)
Required number of users is large. We finally show that whileSetup 4 could emerge superior to Setup 3 as the number of usersincrease, the number of users required is high.We do so by assumingboth the π2- and π-terms are similar in magnitude, i.e. π2
πΆ1 β π2πΌ1 β
Β· Β· Β· β π2πΌπ
β π2πand π1 β π2 β π3 β π, and show that Inequality (27)
is equivalent to
π >
(2β
12(β
6 β 1)π§
)2 π2π
Ξ2 , (30)
where Ξ = (`πΌ2 β `πΆ2) β (`πΌ1 β `πΆ1) + `πΌπ β `πΌπ is the actual effectsize difference between Setups 4 and 3. Note we are determiningwhen Setup 4 is superior to Setup 3 under the second evaluationcriterionβthat the gain in actual effect is greater than the loss insensitivityβand thus assume Ξ is positive.
The equivalence can be shown by substituting π2πinto Inequal-
ity (29c), which already assumes the π-terms are similar in magni-tude:8βπ
[((`πΌ2 β `πΆ2) + (`πΌπ β `πΆ3)) β ((`πΌ1 β `πΆ1) + (`πΌπ β `πΆ3))
]2βοΈ
6π2π
>
β2π§
[ββ2(
32
)2 (1 +
2π2π
6π2π
)β 1
]. (30a)
Noting the expression within the LHS square bracket is equal to Ξ,we simplify the expression within the RHS square root, and moveevery non-π term to the RHS of the inequality to obtain
βπ >
β2π§
[β6 β 1
] 2βοΈ
6π2π
Ξ. (30b)
As all quantities in the inequality are positive, we can square bothsides and consolidate the coefficients on the RHS to arrive at In-equality (30):
π >
(2β
12(β
6 β 1)π§
)2 π2π
Ξ2 .
8Alternatively we can substitute π into Inequality (28b), which already assumes theπ2-terms are similar in magnitude. Simplifying the resultant inequality would yieldthe same end result.