Profit-Maximizing A/B Tests

Marketers often use A/B testing as a tactical tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame such tactical tests as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population. We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than that typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit. We demonstrate the benefits of the method in three different marketing contexts -- website design, display advertising and catalog tests -- in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.


Introduction
Experimentation is an important tool for marketers in a wide range of settings including direct mail, email, display advertising, social media marketing, website optimization, and app design. In tactical marketing settings, which we call "test & roll" experiments, data on customer response is first collected in a test stage where a subset of customers are randomly assigned to a treatment.
In the roll stage that follows, marketers deploy one treatment to all remaining customers based on the test results. Figure 1 shows an example test & roll setup screen. Emails with two different subject lines will each be sent to 8,910 customers at random from a total list of 59,404 email addresses. Once the test outcomes are measured, the platform sends the better-performing email to the remainder of the list.
where y 1 and y 2 are the mean response for each test group, s 1 and s 2 are the standard deviation of the response, n 1 and n 2 are the sample sizes, and the significance level α is the desired type I error that determines the critical value z. 1 When using hypothesis testing, the sample size is fixed prior to data collection and n 1 and n 2 are set to detect an effect of at least d with probability 1 − β. When s 1 = s 2 = s, the recommended sample size is: n HT = n 1 = n 2 ≈ (z 1−α/2 + z β ) 2 2s 2 d 2 The recommendation is to set n 1 = n 2 because this maximizes the statistical power of the experiment when s 1 = s 2 .
We develop an alternative approach to planning and analyzing A/B tests with finite populations.
While null hypothesis testing is the "gold standard" in scientific and medical research and is often recommended for marketing tests (e.g., Pekelis et al. 2015), the statistical significance threshold in (1) is a poor decision rule for test & roll experiments aimed at maximizing profits, for four reasons.
First, hypothesis tests at typical significance levels (e.g., α = 0.05) are designed to avoid concluding that two treatments perform differently when they do not. Yet these Type I errors have little consequence for profit, assuming no deployment costs. If the null is true and both treatments yield identical effects, the same profit will be earned regardless of which treatment is deployed. Because of the profit trade-off between test-stage learning and roll-stage earning, conservative sample sizes based on null hypothesis testing lower overall expected profit, by exposing too many people to the less effective treatment in the test.
Second, the population available for testing and deploying is often limited, but the recommended sample size in (2) does not take this constraint into account. In online advertising experiments where effects are often small (but profitable), the recommended sample size may be larger than the size of the population itself (Lewis and Rao 2015), yet, as we show, when the population is limited, smaller tests that will never reach statistical significance can still have substantial benefit in improving expected profit.
Third, the typical null hypothesis test in (1) provides no guidance on which treatment to deploy when the results are not significant. Many A/B testers advocate deploying the incumbent treatment (if there is one) in the interest of being "conservative", choosing randomly (Gershoff 2017), or continuing the test until it reaches significance, potentially leading to p-hacking (Berman et al. 2018).
Fourth, practitioners often design tests with unequal sample sizes for each treatment (e.g., Lewis andRao 2015, Zantedeschi et al. 2016). Our framework allows unequal sample sizes to arise naturally from prior beliefs, whereas the only rationale for this under classical hypothesis testing is different response variances (s 2 1 = s 2 2 ). We re-frame the test & roll decision problem in Section 2, focusing on profit and making an explicit trade-off between the opportunity cost of the test (where some customers receive the suboptimal treatment) and the losses associated with deploying the sub-optimal treatment to the remainder of the population.
We derive a new closed-form expression for the profit-maximizing sample size in Section 3, assuming that the average revenue per customer is normally-distributed with normal priors. Test sample sizes under this framework are often substantially smaller than those recommended by (2).
Unlike sample sizes for a hypothesis test that increase linearly with the variance of the response in (2), profit-maximizing sample sizes increase sub-linearly with the standard deviation of the response, leading to substantially smaller test sizes when the response is noisy. Profit-maximizing samples are also proportional to the square root of the total size of the population available, and so they naturally scale to both large and small settings.
Improved performance is achieved because profit-maximizing tests identify the best performing treatment with high probability when treatment effects are large; the lost profit (regret) from errors in treatment selection is small when treatment effects are small. We also show that a test & roll that uses the profit-maximizing sample size achieves nearly the same level of regret as Section 4 extends the analysis to situations with different priors on treatments, and provides an efficient numeric approach to computing optimal sample sizes. This allows us to rationalize the common practice of using unequally-sized treatment groups when the two treatments are believed a priori to produce different responses, e.g., a test comparing media exposure to no exposure or a test comparing two different prices. In the test stage, a random sample of n 1 customers are exposed to treatment 1 and a random non-overlapping 2 sample of n 2 customers are exposed to treatment 2, with n 1 + n 2 < N . In the roll stage, all remaining N − n 1 − n 2 customers receive either treatment 1 or treatment 2 based on a decision rule that incorporates the data observed in the test stage. The marketer's goal is to maximize the cumulative profit earned in both stages.
Assuming the profit for each customer receiving treatment j is an independent random variable Y j that follows a distribution with parameters θ j , the expected profit earned during the test phase is: Y j is the profit net of any costs related to the treatments, e.g., media costs or discounts. In website and email tests, for example, the cost of both treatments is the same and can be ignored.
Denote the vector of observed profit from customers exposed to treatment j in the test as y j = y j,1 , . . . , y j,n j . Once y 1 and y 2 are observed, the analyst chooses a treatment to deploy with the remaining N − n 1 − n 2 customers. Let δ(y 1 , y 2 ) be the decision rule which takes the value 1 for the decision to deploy treatment 1 and 0 for treatment 2. The optimal decision rule is to select the treatment with the highest posterior predictive mean E[Y j |y j ] (DeGroot 1970).
Depending on the decision rule, the expected profit in the roll stage is: Increasing n 1 and n 2 provides more observations about the profitability of each treatment, and thus has the potential to yield more correct decisions in the roll stage. Simultaneously, increasing n 1 and n 2 decreases the population remaining in the roll stage and increases the test population, some of which is exposed to the lesser-performing treatment. The parameters θ 1 and θ 2 are unknown prior to the test (hence the need for the test). By assuming a prior distribution over these parameters, we obtain the a priori expected profit of the A/B test: Designing the test entails selecting the sample sizes n 1 and n 2 that maximize the total expected profit: (n * 1 , n * 2 ) = argmax Both our approach and the hypothesis testing approach described in equations (1) and (2) are decision-theoretic but differ in four aspects: (1) We define the decision as whether to deploy treatment 1 or treatment 2, instead of deciding whether to reject the null hypothesis; (2) The objective in hypothesis testing is to maximize statistical power while controlling type-I error, while we focus on maximizing profits; (3) Hypothesis testing uses a 0/1 loss function, and so every incorrect decision has the same cost, while our approach uses the actual opportunity cost as the loss, including the cost of the test.
Like all experimental designs, (n * 1 , n * 2 ) are sensitive to the specification of the model for Y j and the priors on θ j . These priors can be set based on previous experience with similar marketing treatments, as we illustrate in Section 5.

Test & Roll with Symmetric Normal Priors
To derive a profit-maximizing sample size formula, we assume Y 1 ∼ N (m 1 , s 2 ) and Y 2 ∼ N (m 2 , s 2 ) with identical priors m 1 , m 2 ∼ N (µ, σ 2 ). 4 The variance of the response, s 2 is known; in practice it can be estimated from previously observed responses. 5 The hyper-parameters µ and σ represent expectations for how the two treatments may perform, which can be be informed by previous similar marketing campaigns (as illustrated in Section 5).
The symmetric priors imply that neither treatment is a priori likely to perform better, but they do not imply that m 1 = m 2 . The implied prior on the treatment effect m 1 − m 2 is N (0, 2σ 2 ) and the expected absolute difference between treatments |m 1 − m 2 | is distributed half-normal with mean √ 2 √ π σ. Thus σ is related to the a priori expectation about the potential difference in treatment effects (as well as the uncertainty).
The expected profit in the test stage for this model is: The expected profit in the roll stage depends on the decision rule δ(y 1 , y 2 ). The profitmaximizing decision rule is to choose the treatment with the greater expected posterior mean response: δ(y 1 , y 2 ) = I 1 where y j is the average response of treatment j and I(·) is the indicator function. Since the priors are symmetric, this reduces to δ(y 1 , y 2 ) = I (y 1 > y 2 ) if n 1 = n 2 , i.e., the highly-intuitive "pick the winner" in the test. 4 The normal model leads to closed form expressions for profit and sample size and is a good approximation for binomial response (e.g., clicks, purchase incidence) when test sizes are large. For binomial responses with small test sizes (n1, n2), Appendix B develops a beta-binomial version where sample size is computed numerically. 5 The assumption that s1 and s2 are known could easily be relaxed by putting priors on them, but this is not necessary for deriving key insights.
Proposition A.1 shows that the decision rule in (8) yields an expected roll-stage profit of: The second addend in the square brackets is the expected incremental profit per customer earned by (usually) deploying the better treatment relative to choosing randomly with expected profit of µ.
This incremental per customer gain from the test is increasing in the test size n 1 and n 2 . However, as (n 1 + n 2 ) increases, the number of customers for whom this higher profit is earned is smaller.
The incremental gain increases with σ which is related to the expected effect size and decreases with the measurement noise s.
To find the optimal sample size, the sum of the test profit in (7) and the deployment profit in (9) can be maximized over n 1 and n 2 resulting in optimal sample sizes (Proposition A.3): The profitmaximizing sample size is always less than the population size N and grows sub-linearly with the standard deviation of the response s. By contrast, the recommended sample size for a hypothesis test in (2) grows linearly with the variance s 2 without regard to N . This explains why, for noisy responses, hypothesis tests frequently require sample sizes that are larger than the available population (Lewis and Rao 2015).
Notably, the profit-maximizing sample size decreases with σ. Large σ implies: (1) a larger expected difference between treatments and, (2) a lower error rate for a given sample size (see (12) below), while (3) the opportunity cost remains the same.

Error Rate
Test & roll does not require the planner to specify an acceptable level of error; the error rate follows from optimally trading off the opportunity cost of the test against the expected loss in profit due to deployment errors. However, practitioners may want to know the expected error rate. Conditional on m 1 and m 2 , the likelihood of deploying treatment 1 when treatment 2 has a better mean response is: From (11), we see that when the difference in treatments m 2 − m 1 is positive and large, the error rate is lower, i.e., the better treatment will be deployed. When m 2 − m 1 is smaller, it is more likely that the wrong treatment will be deployed, but this is less consequential for profit.
Integrating (11) over the priors on m 1 and m 2 , the expected error rate is (Corollary A.2): As expected, the error rate decreases with the test sizes n 1 and n 2 , increases with s, and decreases with σ.

Regret
To provide an upper bound on the total expected profit, we compute the expected profit with perfect information (PI). If an omniscient marketer were able to deploy the treatment with higher expected profit to all N customers without testing, the expected profit would be (Proposition A.4, part 1): The profit of any algorithm for choosing which treatment to deploy to each customer will be between the expected value of choosing randomly, which is µN and the expected value of perfect information in (13). The expected profit with perfect information scales with the variance of the prior σ; the more potential difference there is between treatments, the more opportunity there is to improve profits by choosing the better treatment.
The expected regret of the profit-maximizing test & roll experiment is (Proposition A.4, part 2): When populations are larger, the regret per customer decreases, hence marketers with larger populations have a greater opportunity to improve profits on a per-customer basis with a profit-maximizing test.
Using the sub-optimal sample size recommended for a hypothesis test produces substantially greater regret. Assuming that the better performing treatment will be deployed after the test regardless of significance, 6 we can substitute the value of n HT from (2) for n * in (14). The regret from using the larger sample size is (Proposition A.4, part 3): implying that hypothesis testing has a lower bound expected regret of Ω(N ), substantially larger than the profit-maximizing sample size with regret O( √ N ) as N becomes large. 7 We can also compare a test & roll with profit-maximizing sample size to a multi-armed bandit where allocation to treatments is determined probabilistically for each customer based on previous customers' response. Agrawal and Goyal (2013) show that the expected regret of a multi-armed bandit with Thompson sampling (Thompson 1933)

Test & Roll with Asymmetric Normal Priors
The analysis thus far focused on cases with a common prior for both treatments. However, there are many situations where the priors might be different, e.g., comparing a marketing communication against a holdout control.
Under these priors, the a priori expected profit in the test stage is: Decision rule (8) is still optimal in this case, but does not imply selecting the treatment that performs better in the test anymore; the prior information now also affects the decision. Using the decision rule in (8), the a priori expected profit in the roll stage is (Proposition A.1): The expected total profit can be maximized over n 1 and n 2 to find the optimal sample size. The optimal sample sizes can not be solved for analytically, but the function can be easily optimized numerically.
One example of an asymmetric test & roll experiment arises when the experimenter has more past experience with treatment 1 vs. treatment 2, implying that σ 1 < σ 2 . We dub this an "incumbent/challenger" test. For example, an incumbent can be an ad copy or page design that follows the traditional firm branding strategy, while a challenger uses a new creative approach.
When σ 1 < σ 2 , the optimal sample size will be larger for the challenger treatment, to gain more information about the challenger in the test. Specific sample size formulae are provided in Appendix C.
A second common case for asymmetric test plans are pricing experiments. Because companies face uncertainty about product demand, they often experiment with multiple prices. Different prices, however, influence two important factors. First is the amount of people who will purchase the product; higher prices will elicit fewer purchases. Second is the profit per person; higher prices yield higher profits conditional on purchase. Thus, setting different prices effectively changes the priors on the mean profit per customer, which implies different optimal sample sizes. Appendix C describes an example setup.

Applications
Designing a profit-maximizing test & roll requires priors on the distribution of the mean response rate of the treatments. This section illustrates how to estimate these priors using data on past marketing interventions, similar to using a pre-test to inform priors for conjoint design (Arora and Huber 2001). We then use the estimated priors to provide optimal test plans for three different marketing contexts and compare them to hypothesis testing and multi-armed bandits, based on expected profit and regret. The first two applications use symmetric priors, while the third presents a situation where asymmetric priors are appropriate.

Website testing
To set priors based on past data, we analyze 2,101 website tests from Berman et al. (2018) which were conducted across a wide variety of websites. For each treatment arm in each experiment we observe the click rate,ȳ and sample size n. 9 Fitting a hierarchical model to this data to account for the fact thatȳ is a noisy estimator, we estimate that the mean responses (click rates) are distributed N (0.68, 0.03) across treatment arms. (Appendix E.1 details the data and estimation.) To plan a new test, we assume this as a symmetric prior on mean response (m 1 and m 2 ).
Assuming symmetric priors is reasonable as there is typically no prior information that one version of a web page will perform better than the other. The assumed prior implies an expected absolute difference between treatments of E[|m 1 − m 2 |] = 0.023.
We compute the sample size based on (10), using µ(1 − µ) to approximate s. 10 The population size N is set based on the expected number of people who will visit the website over the deployment period. As an example, with N = 100, 000, the optimal test size is n * 1 = n * 2 = 2, 284 in each test group. The expected number of clicks is 3,106 in the test and 66,430 more when the betterperforming treatment is deployed, for a total of 69,536 conversions. Following (12), this test will deploy the worse-performing web page 10.0% of the time, and this represents the optimal trade-off with the opportunity cost of the test. The profit-maximizing test & roll has expected regret of 9 While it would be ideal to observe sales and revenue for each visitor, this is not always possible. As a proxy, we assume for this example that profit is proportional to the number of clicks.
10 The normal approximation works well for binary tests that have moderate response rates and large sample sizes.
For smaller test sizes or more extreme response rates, a beta-binomial model produces a more accurate distribution of the mean response rate across arms. See Appendix B.
0.22% relative to expected profit with perfect information 11 and achieves 90.7% of the potential gains over choosing randomly. larger than optimal decrease the error rate marginally (Figure 2b), but erode overall expected profit ( Figure 2a).
Comparing the test & roll sample size to that recommended for a hypothesis test in (2) requires selecting acceptable levels of type I and type II errors (α, β) and minimum detectable effect (d).
Following typical A/B testing guidelines, we assume: α = 0.05, β = 0.8 and d = 0.68 × 0.02 = 0.0136, i.e., a 2% lift. 12 The resulting recommended sample size for a hypothesis test is 18,468 in each group, almost an order of magnitude larger than the profit-maximizing test size. The sample size for a hypothesis test is set to control type I and type II error tightly irrespective of the opportunity cost of the test, resulting in much larger sample sizes than are necessary to maximize expected profit. In this application, the oversized test reduces the remaining population that can receive the better treatment and results in 476 fewer expected conversions (see Figure 2). linearly with the response noise s, unlike the recommended sample size for a null hypothesis test which increases with s 2 . Panel (c) shows that when σ is larger, smaller test sizes are sufficient to detect treatments that on average perform substantially better. 13 To compare test & roll to a multi-armed bandit, Figure 4 shows the distribution of regret for 1,000 simulations using either a test & roll with a profit-maximizing sample size or a multiarmed bandit with Thompson sampling (see Appendix D for details). Both methods use a decision rule based on the same posterior, but the multi-armed bandit has more flexibility to recover from early observations that favor the wrong treatment. The more flexible multi-armed bandit setup achieves a tighter regret distribution and lower average regret than a profit-maximizing test & roll.
However, the difference is small: Thompson sampling achieves average regret of 0.09%, while test & roll achieves 0.22% average regret. Profit-maximizing test & roll becomes an attractive option, 13 The value of nHT shown in Panel (c) assumes d is set at the 25%-tile of the prior of the absolute treatment effect. Profit, regret and error rates are summarized in Table 1.

Display advertising testing
As a second example of a profit-maximizing test & roll, we base priors on online display experiments reported by Lewis and Rao (2015). We focus on 5 experiments reported for "Advertiser 1". Lewis and Rao (2015) report the mean and standard deviation of the sales response ($) in the control Ideally, we would estimate a similar distribution for the treated group, creating asymmetric priors, but Lewis and Rao (2015) do not report the treatment effects for these experiments. Instead, we assume the profit per customer m 2 has the same prior distribution as m 1 . That is, on average the ads produce a lift that precisely covers the cost.
Assuming a total population size of N = 1, 000, 000, the profit-maximizing sample size is n 1 = n 2 = 11, 391. Even with this small test, the decision of whether or not to advertise to the remainder of the population is incorrect only 6.9% of the time. By contrast, these tests would require a sample size of 4,782,433 in each group for a hypothesis test to detect a difference of d = 0.19 at α = 0.05 and β = 0.80. 14 Test sizes, profit and error rate are summarized in Table 2. As Lewis and Rao (2015) point out, tests of this size are infeasible within the budget of most advertisers and the population available on most ad platforms. However, a risk-neutral firm can reliably determine whether advertising is more profitable than not and maximize expected profits with far smaller tests. As can be seen by comparing (2) and (10), the difference in sample size is larger when s large, as it is for the display advertising tests. Even if we cut the prior variance σ in half and increase the population to N = 10, 000, 000, the profit-maximizing sample size only increases to n 1 = n 2 = 234, 361, still far lower than that required for a hypothesis test.
14 The difference of 0.19 is approximately the difference between ROI= -100% and 0% assuming the ads cost 0.094 per user (the average reported cost across experiments) and the margin on retail sales is 0.5. This sample size is similar to those calculated by Lewis and Rao (2015) in Table III.

Catalog holdout testing
Finally, we illustrate how asymmetric priors described in Section 4 lead to unequal test group sizes. We estimate priors based on 30 catalog holdout tests conducted by a specialty retailer. For each customer in each test, we observe all-channel sales ($) in the month after the catalog is sent.
Appendix E details how the data is used to estimate the distribution of mean catalog responses for the treated and holdout groups. Figure 5 shows After accounting for the cost of the media (approx. $0.80), about 76.8% of catalog campaigns are expected to increase profit based on the priors in Figure 5. A test & roll experiment can be used with future campaigns to prevent mailing to the entire list when it is unprofitable. Assuming a population size of N = 100, 000, the profit-maximizing sample sizes are n * 1 = 588 (control) and n * 2 = 1, 884 (treated). An experiment with these sample sizes achieves expected total sales of $3,463,000. The recommended sample size for a hypothesis test to detect a 25% sales lift is 7,822  Table 3.
The profit-maximizing test and the null hypothesis test both allocate a larger sample to the treatment group, but for different reasons. The hypothesis test does so because the treatment group has a noisier response (s 1 < s 2 ). The profit-maximizing test additionally considers that we a priori expect greater profits from customers who receive the catalog (m 1 < m 2 ). Even if we fix s 1 = s 2 and re-estimate the hierarchical model (see Appendix E), the resulting test & roll sample size is n 1 = 771 and n 2 = 1, 949, due to the remaining differences in the priors. Figure 6 shows the sensitivity of the sample sizes to the expected catalog lift. We analyzed this sensitivity by varying µ 2 , leaving all other parameters of the priors fixed. As the plot shows, when the expected lift is very high, a small holdout group is optimal. Thus, the common practice of using small holdout tests can be rationalized by a prior expectation that the treatment increases sales (or other desired behavior) more than the cost of marketing. The test & roll framework provides a principled way to set the size of the holdout group by making these priors explicit.

Discussion
We present a new approach to planning sample sizes for A/B tests. Unlike the classic hypothesis test that emphasizes high confidence and power, our approach optimally balances the trade-off between not deploying the best treatment in the roll stage and the cost of identifying this treatment in the test stage. The practical result is far smaller recommended test sizes that scale to the size of the available population. Most importantly, by focusing on profit, we show that marketers should 15 When σ1 = σ2, then the sample sizes n1 = z (1−α)/2 + z β 2 s 2 1 +s 1 s 2 δ 2 and n2 = z (1−α)/2 + z β 2 s 1 s 2 +s 2 2 δ 2 minimize n1 + n2 while achieving the desired confidence and power. See Luh and Guo (2007).  One limitation of our method is that the best treatment will not always be selected. Although the error-rate may be higher than the one guaranteed by typical null hypothesis testing, the profit-maximizing test size sets the error rate optimally, based on the potential differences between treatments and resulting opportunity costs. In contexts where the decision maker is risk averse or the cost of deploying a subpar treatment is very high, as in clinical trials (Berry et al. 1994, Cheng et al. 2003, then other approaches are warranted. Further extensions of the test & roll framework presented in Section 2 would be useful. Other prior distributions could be considered, particularly those with fat tails (Azevedo et al. 2018).
The method could be extended to more than two treatments, potentially allowing for correlated priors, e.g., for a holdout group versus several alternative marketing treatments. The cost of switching between treatments, which can be substantial for offline marketing treatments, could also be incorporated into the decision problem. If it is possible to deploy different treatments to sub-

Appendix A Normal-Normal Model Derivations
Proposition A.1 (Expected Roll Stage Profit). When the mean profit y j is distributed y j ∼ N (m j , s 2 j /n j ) with prior m j ∼ N (µ j , σ 2 j ), and when the decision rule picks the arm with the highest posterior mean, the expected profit in the roll stage is: Proof. Denote the decision rule δ(y 1 , y 2 ) = I(a 1 + b 1 y 1 > a 2 + b 2 y 2 ). The linear decision rule includes the optimal one that uses the posterior predictive distribution with a j = s 2 j /n j µ j σ 2 j +s 2 j /n j and b j = σ 2 j σ 2 j +s 2 j /n j . Denote the pdf of y j as f j and its cdf as F j . Denote the pdf of m j as g j and its cdf as G j .
The expected value from the roll stage is: In the derivation, we will make multiple uses of the following identities: and: The expression (N − n 1 − n 2 ) can be taken out of the integrand. Continuing with the first additive in the integral (the second will be symmetric): The last equation uses y = y 1 −m 1 s 1 / √ n 1 as a change of variables.
Using identity (21), the final integral equals: Plugging back into the expected value in (19), the expected value of the roll stage equals: Using identity (21) again, the first additive equals: where the last equation uses the change of variables m = m 1 −µ 1 σ 1 . Using identities (20) and (21), we receive: Using symmetry, the a priori expected value of the roll stage is: Plugging in the posterior mean parameters for a j and b j (as they are optimal), the roll stage expected value in the fully asymmetric model is: where in the text we set e = µ 1 − µ 2 and v = σ 4 1 σ 2 1 +s 2 1 /n 1 + σ 4 2 σ 2 2 +s 2 2 /n 2 in Equation (17). Thus we have completed the proof for the asymmetric case.
To get the expression in (9) we plug-in µ 1 = µ 2 = µ, σ 1 = σ 2 = σ and s 1 = s 2 = s into the above expression. Proof. Using the fact that y j ∼ N (m j , s 2 /n j ) and because in the symmetric case the decision rule is to pick the treatment with the highest mean: P r(δ(y 1 , y 2 ) = 1|m 1 , m 2 ) = P r(y 1 − y 2 > 0|m 1 , If we denote m = m 1 − m 2 , then m 1 − m 2 has a prior N (0, 2σ 2 ). The expected error rate is therefore: Using the identity , we get the expression: n 1 n 2 n 1 + n 2 Proposition A.3 (Profit maximizing sample size). When the mean profits y j are distributed y j ∼ N (m j , s 2 /n j ) with prior m j ∼ N (µ, σ 2 ), the profit-maximizing sample size is: Proof. Because the priors are symmetric, the optimal sample sizes will be equal. Denote them as n = n 1 = n 2 .
Proposition A.4 (Regret). In the symmetric Normal-Normal model with a population size N :

The expected value of perfect information is E[Π|PI
2. The regret of the profit-maximizing design is O( √ N ).
3. The regret from using a classic hypothesis test is Ω(N ).
Proof. Perfect information allows the marketer to pick the treatment with the highest mean m j without testing, yielding expected profit of N · E[max(m 1 , m 2 )]. Because both treatments come from the same prior N (µ, σ 2 ), the mean of the maximum of two i.i.d Normal variables is µ + σ √ π , proving the first item.
To prove the second item, we calculate the regret from using the profit maximizing design: Using the inequality x for x > 0, and denoting x = n * σ 2 /s 2 , the first additive results in: ≤ N σ √ π 1 2 σ 2 n * /s 2 + 1 σ 2 n * /s 2 (44) , the denominator 2n * σ 2 /s 2 is larger than 1 2 σ s √ N when N > 4 s 2 σ 2 . Hence, we can bound the first additive in the regret (42) from above by: To bound the second additive: The first inequality uses the fact that s 2 n * is positive, while the second uses the fact that n * < √ N s 2σ as shown in the main text.
Summing the two additives shows that the regret of the profit maximizing design is smaller than 3s √ N √ π proving the second item that the regret is O( To prove the third item, we plug-in the sample size from (2) for n in the regret formula: where the equality in (50) follows from plugging-in the NHST sample size denoting z = z (1−α)/2 +z β , and the last inequality follows from

B Derivations for Beta-Binomial model
Let the profit y ij from customer i exposed to treatment arm j be v j with probability p j and 0 with probability 1 − p j , and let y j = n j i=1 y ij v j be the number of conversions with treatment j, when n j is the number of individuals assigned to treatment j. We put a Beta(α, β) prior distribution on p j and denote its pdf as f (·).
Proposition B.1 (Beta-Binomial Expected Profit). If profit y ij from customer i exposed to treatment j is v j with probability p j and zero otherwise with priors p j ∼ Beta(α, β): 1. The expected profit in the test stage is (n 1 v 1 + n 2 v 2 ) α α+β 2. The expected profit in the roll stage is: Proof. To prove the first item, the expected profit in the test stage is: v 1 y 1 P r(y 1 |p 1 )f (p 1 )dp 1 + p 2 n 2 y 2 =0 v 2 y 2 P r(y 2 |p 2 )f (p 2 )dp 2 Because n j y j =0 y j P r(y j |p j ) = n j p j , then p 1 n 1 y 1 =0 y 1 P r(y 1 |p 1 )f (p 1 )dp 1 = n j α α+β , and plugging this in yields the expression in in the proposition.
The prove the second item, the a priori expected profit in the roll stage is: Focusing on the first additive (the second will be symmetric because of the symmetric prior), it can be written as: δ(y 1 , y 2 )P r(y 2 |p 2 )P r(y 1 |p 1 )p 1 f (p 1 )f (p 2 )dp 1 dp 2 The optimal decision rule δ(y 1 , y 2 ) is to pick the treatment with the highest expected posterior profit v j E[p j |y j ] = v j α+y j α+β+n j , resulting from the fact that the profits are Binomially distributed with a Beta prior. Hence, by denotingỹ 1 = α v 2 v 1 α+β+n 1 α+β+n 1 α+β+n 2 , and by applying Fubini's theorem, we can rewrite (55) as: The derivation above assumes that if the expected posterior profit of both treatments is equal, then treatment 1 is chosen as a tie-breaking rule. We will show that this tie-breaking rule does not change the result if we opt for another rule (e.g., pick treatment 2 if tied, or pick one randomly).
We can calculate P r(y j ) as: Plugging into (56), the total roll stage profit is: If there is a tie such that v 1 α+y 1 α+β+n 1 = v 2 α+y 2 α+β+n 2 , it does not matter if we take the left or the right additive within the parenthesis. Hence, any tie-breaking rule will yield an equivalent profit.
To design a test for binomial experiment, the expected profit from Proposition B.1 can be numerically optimized, using a discrete optimization heuristic.

C.1 Incumbent Challenger Test
In an incumbent/challenger test more is known about one treatment than the other. Denote σ 2 = cσ 1 with c > 1. To analyze this scenario in closed form, we will assume that µ 1 = µ 2 and that s 1 = s 2 = s, although the solution can be found numerically for any set of values. Because the uncertainty is larger for treatment 2, it is always the case that n * 2 > n * 1 in an incumbent/challenger test. When the population size is small enough, it is too wasteful to experiment with treatment 1, and the test will only include exposures to treatment 2. After this test phase, comparison will be made to the prior on treatment 1 to select which treatment to deploy. This is shown formally in Proposition C.1: Proposition C.1 (Incumbent/Challenger sample sizes). In an asymmetric test when treatment 1 is an incumbent and treatment 2 is a challenger such that µ 1 = µ 2 , s 1 = s 2 = s and σ 2 = c · σ 1 with c > 1: 1. The optimal sample sizes are: n * 1 = s 2c 2 (c 2 + 1) N σ 2 1 + (2c 4 + 5c 2 + 2) s 2 − cs(1 + 2c 2 ) 2 (c 3 + c) σ 2 1 (61) n * 2 = s c 2c 2 (c 2 + 1) N σ 2 1 + (2c 4 + 5c 2 + 2) s 2 − c 2 + 2 s 2c 2 (c 2 + 1) σ 2 1 (62) 2. n * 2 > n * 1 for any value of N , s, c > 1 and σ.
To prove the second item, the inequality n * 2 − n * 1 > 0 can be written as: which always holds because c > 1.
To prove the third item, we solve for n * 2 > 0, which holds for the described parameter values, and n * 1 > 0 which holds if and only if N > (2c 4 −c 2 −1)s 2 c 2 σ 2 1 .

C.2 Pricing Test
Suppose the firm would like to pick between two known prices, p 1 and p 2 , and that demand from customer i presented with price j is d ij = a − m · p j + ε ij . In this model, demand is linear in price, a is a maximum willingness to pay for the product, m is the uncertain price sensitivity with a prior distribution N (µ, σ 2 ), and ε ij ∼ N (0, s 2 ). The profit from a customer i presented with price j will be y ij = p j d ij . This model is an instance of the asymmetric model, when we denote µ j = p j (a − µp j ), σ j = p j σ and s j = p j s. Consequently, the profit and sample size formulas derived for the asymmetric case can be applied directly to pricing experiments, and will recommend different sample sizes depending on the levels of prices being tested.

D Thompson Sampling for the Normal-Normal model
Thompson sampling (Thompson 1933) has recently become the prominent heuristic for implementing multi-armed bandits, due to its superior performance and ease of implementation (Scott 2010, Schwartz et al. 2017). Here we describe the Thompson sampling algorithm we use, which is the standard implementation applied to the normal symmetric model.
Opportunities to apply the treatment are assumed to come in one at a time for each i = 1 . . . N .
Under the symmetric normal model, treatment j generates outcomes y ji drawn from N (m j , s 2 ).
3. Either y 1i or y 2i is observed based on the decision. In simulation y ji is drawn from its true distribution N (m j , s 2 ).
4. The hyperparameters µ j (i) and σ 2 j (i) are updated given the new data. If treatment j was not deployed, the hyperparameters at time i equal those at time i − 1. If the treatment was deployed, the hyperparameters are calculated as the posterior of the normal distribution, with the observed outcome used as data and the hyperparameters from period i − 1 used for the prior.
Thus, treatments are probabilistically sampled according to the current probability that each treatment is best, i.e., treatment 1 is sampled at the rate of P r(µ 1 (i) > µ 2 (i)). This rule favors treatments with higher expected response and, as a result, the algorithm will quickly converge to the best-performing treatment as data accumulates. However, it also is also more likely to sample treatments with higher uncertainty, because of the high potential upside for those treatments, which helps to avoid converging to the wrong treatment.
The explicit explore versus exploit trade-off in a multi-armed bandit is similar to the tradeoff between the size of the test sample and the remaining population in a test & roll, albeit more dynamic. The dynamic approach works better when opportunities to apply the treatment are spread out over time and the desired response is immediately available (e.g., website tests where the response is a click), but can be difficult to execute when the response is not immediately observable (e.g., sales) or when the treatments are sent out in batches (e.g., direct mail). Agrawal and Goyal (2013) have shown that the regret from Thompson sampling with Normal outcomes and Normal priors is O( √ N ). This has been shown before to be the best achievable regret for any dynamic multi-armed bandit approach when compared to having perfect information, and hence Thompson sampling is an ideal benchmark for comparison.

E Application Details
If a firm has data on prior marketing treatments that are similar to those that will be tested, this data can be used to estimate the distribution of mean response needed to compute the test & roll sample size using (10). For example, if the firm has past data on response y ij for each customer i in each of j = 1, . . . J previous marketing campaigns, then we can fit a hierarchical model: where µ and σ are the parameters of the distribution of mean response and can be plugged into (10)  The campaigns j can be defined by a particular period of time when a marketing treatment was in place and the response was stable, such as response rates to direct marketing campaigns or customers visiting a website in a particular month. The key assumption is that these prior campaigns represent the range of likely mean responses for the treatments in the test that is being planed. We provide more details for specific applications below.

E.1 Website Testing Example
The data on website tests is adopted from Berman et al. (2018)   . were conducted across many websites with a wide range of click rates, there tends to be correlation in the click rate between the two arms in the same experiment. To account for this, we assume that each experiment k has it's own mean click rate t k and assume that the means for the treatment arms within the experiment are distributed normal around the click rate for the experiment as follows: Because the data is binary, we follow the binomial approximation and assume s = m 1k (1−m 1k ), reducing the number of estimated population parameters to three. The model is estimated using the HMC algorithm implemented in Stan (Stan Development Team 2018) with diffuse priors on the hyper-parameters and the estimates are reported in Table 4.
In the empirical model, ω captures the variation in mean response across experiments, while σ captures the variation between arms within an experiment. In sizing a test & roll experiment following (10), we are interested in the potential differences between arms within a single experi-

E.2 Display Ad Testing Example
We illustrate how "Advertiser 1" in Lewis and Rao (2015) might obtain the parameters µ and σ in order to find the profit-maximizing sample size for a new test & roll with treatments that are expected to perform similarly to experiments 1.1, 1.2, 1.3, 1.5 and 1.6 reported in Table 5. We eliminated experiment 1.4 because it had a substantially different media cost and response rate for the control group versus the other experiments and appears to be targeting customers with higher baseline purchase propensity. Using the data in Table 5, we estimate the following hierarchical model for the mean response in the control group reported for each experiment j.
y j ∼ N m j ,ŝ √ n (71) m j ∼ N (µ, σ) where the sampling distribution for y ijk in equation 69 has been replaced withȳ j , since we do not have access to the user-level data. The estimates of µ and σ reported in Table 6 are used in designing a new test & roll for Advertiser 1. s is estimated as the average of s j across the 5 experiments, which is 103.77.
Because we are estimating the variance in mean response σ from just 5 experiments, the posterior of σ is relatively wide. As can be seen from (10), the profit-maximizing sample size will be largest when σ is smallest. Conservatively, one might use the posterior 2.5%-tile for σ, instead of the posterior mean. This results in a profit-maximizing sample size of 18,486, still far smaller than that recommended for a hypothesis test.

E.3 Catalog Holdout Testing Example
The Individually, the catalog holdout tests have very imprecise estimates for response due to small sample size and high noise in the data. The hierarchical model is particularly valuable in pooling information across the tests and propagating uncertainty due to small sample sizes. We fit a model similar to that used for the website tests, except that we allow for µ 1 = µ 2 and σ 1 = σ 2 , because unlike for the website tests, there is a clear distinction between the treated and holdout conditions.
The model we fit is: y i1k ∼ N (m 1k , s 1 ) for customers in control group (73) y i2k ∼ N (m 2k , s 2 ) for customers in treatment group (74) m 1k ∼ N (t k , σ 1 ) (75) m 2k ∼ N (t k + ∆, σ 2 ) (76) By modeling the overall response rate for the experiment t k , we allow for the different targeted populations to have different response rates and account for the correlation in response within experiments. In planning a new test, we focus on the the variation in response rates within the experiment, as estimated by σ 1 and σ 2 .
Samples from the posterior are obtained using the HMC algorithm implemented in Stan with uniform priors on the hyper-parameters. The posterior means for µ 1 , ∆ = µ 2 − µ 1 , σ 1 and σ 2 reported in Table 7 are used as point estimate to compute the asymmetric test & roll sample size. We also estimated a version of the model where s 1 was constrained to be the same as s 2 and used these estimates to show that unequal group sizes can arise from the priors (unlike in null hypothesis testing). The resulting estimates are reported in Table 8.