The fixed-sample, sequential, and Bayesian methods all start from the same place: estimating the mean () and variance () of the lift; the former is an estimate of the size of the treatment effect (as a percentage of the baseline metric value), and the latter is an indication of how reliable that estimate is for predicting what will happen if you ship the treatment.
Once we have an estimate of the observed lift (both its mean and variance), we can use that to construct a confidence interval that describes the plausible values for the true lift.
First, some notation:
For a given metric, we observe for each subject in the control group of size , and for each subject in the treatment group of size . The set of all observations in the control group is , and for the treatment group it is .
The true population mean (which we don't know!) of the metric among the control group is , and among the treatment group it is . Similarly, the true (unobserved) population variance is and
and are the averages of and across all subjects in the control group and treatment group, respectively:
and are the sample variances of and for the control group and treatment group, respectively:
Regardless of the distribution of and , the Central Limit Theorem (CLT) says that, when and are sufficiently large, the mean of and are each normally distributed; that is:
Eppo also supports ratio metrics, where instead of estimating the mean of a single metric, we are interested in understanding the ratio of two means (or, equivalently, the ratio of two sums), such as average order value or time per session. In this case (leaving out the / subscript for brevity), we observe from a sample of size the numerator metric and the denominator metric for each subject (with the set of all subject observations being and respectively) and are trying to estimate:
As above, and can be estimated by their sample means and , which are normally distributed around the true values with variances and , which in turn can be estimated by the sample variances and .
Estimating the ratio then becomes a matter of dividing the distributions for each component of the ratio; under some reasonable assumptions1, the resulting quotient is also approximately normally distributed:
The variance term can be calculated using the delta method (with representing the covariance between and ) as:
Note that the extra in the denominator of the last term is because we need the covariance between the sample averages.
For each variation, then, we can plug in the sample moments of the numerator and denominator metrics to calculate the values for the ratio:
The below analysis then simply uses these values for ratio metrics instead of the simple sample means and variances for each variation.
We want to calculate the lift, which we'll call :
But, since we don't know the true values and , we'll need to instead estimate the lift. We know from the CLT, as shown in equation 3 above, that and are approximately normally distributed (for sufficiently large and ); furthermore, since and are independent, under reasonable assumptions the ratio is approximately normal.2 This allows us to model as a normal distribution:
Note that the calculation of the variance relies on the delta method.
Thus, we have estimated the lift as being normally distributed with a mean of and a variance of . For the frequentist methods (fixed-sample and sequential), that's where we end, in terms of estimating the lift (see below for turning that estimate into confidence intervals). For Bayesian, however, there's one more step.
If you are using CUPED, then the estimate of the lift will be a bit more complicated; we still model the lift as a normal distribution, but the mean and variance are computed after using a ridge regression to account for random (that is, not correlated with the treatment assignment) pre-experiment differences between the groups. See the CUPED docs for more information.
In a Bayesian framework, you start with a prior distribution, which describes what you believe before running the experiment. Then, you run the experiment and collect data, which you use to update your prior: in essence, you combine your pre-experiment beliefs about what the lift would be, with the evidence you've gotten from the experiment, into a new set of beliefs, called the posterior (because it comes after gathering data). The estimated average lift is then just the mean of this posterior distribution.
Setting the prior
In our implementation of the Bayesian approach, we use a normal distribution3 as our prior for the lift:
In other words, our prior is that the lift, on average, will be zero, and that for each metric, about 50% of experiments will show a lift between -21% and +21%, and about 95% of experiments will show a lift between -62% and +62%; from our experience running experiments, this is a fairly conservative prior, as having lifts over ±50% is extremely rare.
Updating the prior
The evidence from the experiment is construed as a normal distribution just as with the frequentist methods (see equations 2–4 above). However, for the Bayesian method we use this evidence to update the above prior, and the result is our posterior.
Specifically, our posterior is a normal distribution with mean and variance , where:6
In other words, our posterior mean is the weighted average of the prior and the observed data, where for each term the weight is the precision (that is, the inverse of the variance). The variance, meanwhile, is related to the harmonic mean of the variances of the prior and the observed data.
We can also rewrite4 the posterior mean (equation 6) as:
Note that reflects how spread out the data are, relative to the prior,5 and is the distance between what we've observed and our prior expectation; thus, we can interpret equation 8 as showing that our posterior lift is the lift we observed in the experiment shrunk toward the prior (that is, toward 0), and that the shrinkage will be larger if our data is noisy (such as happens when we have few observations) and/or our prior is very strong (that is, is low).
Arriving at the posterior
Since our prior is that the lift is zero (that is, ), we can simplify the statement of our posterior (equations 7 and 8) to:
Now that we have (normal) distributions describing our estimate of the lift, that is, an estimate for the mean and variance of the lift, we can construct confidence intervals. The width of those intervals will depend on the confidence level, , which represents the desired likelihood of the interval including the true lift.
For frequentist methods, we want to ensure that our confidence interval contains the true lift with some minimum probability (our confidence level).7 Specifically, after observations (where and can be thought of as a function of time), we want to set the lower and upper bounds of our confidence interval such that:8
Fixed-sample analysis assumes that the results are only looked at once, and therefore only a single interval need be constructed—so we can ignore the subscripts in the above constraint (eq. 15). Given that we have constructed a normal distribution that describes the lift estimate, we can therefore simply use that distribution's quantile function , where is the quantile function for the standard normal distribution and is the desired quantile. Specifically, this gives us lower and upper bounds for a given confidence level (e.g., for a 95% confidence interval):
For the sequential analysis method, we need to ensure that the constraint in equation 15 holds for all at once. That is:
This means that we do not have a single confidence interval, but rather a confidence sequence: an infinite sequence of confidence intervals such that, not only is each individual interval valid for controlling the rate of false positives, but the aggregation of all intervals is valid as well. In other words, a confidence sequence provides statistical guarantees while allowing you to peek at the results as many times as you want (including after every single observation) and to stop the experiment at any time.
The method we use for constructing the bounds and comes from Howard et al., “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences” and has the following useful properties:
- The statistical guarantees are valid even for very broad assumptions about the underlying distribution of .
- You do not need to predeterimine the sample size.
- You can peek at results any number of times, and can decide, based on what you see, to shut down the experiment or keep it running to collect more data.
- You can use any stopping rule you like, meaning, for example, you can change the confidence level in the middle of the experiment.
- As you collect more data, the width of the intervals will tend to get smaller and smaller, and eventually will approach zero.
- Although the confidence intervals are wider than for fixed-sample analysis, the penalty incurred for the additional flexibility and generality is smaller than that from a host of previous methods.
We use a slightly modified version of equation 14 from the reference, with some changes in notation, to construct our bounds.9 Specifically, using the estimated lift from equation 3, the estimated standard error of the lift from equation 4, the confidence level , and the total sample size :
where is a tuning parameter that is used to determine where the ratio between the sequential confidence interval width and the fixed-sample confidence interval width is minimized; we set to try to minimize the cost (in terms of additional sample needed) of the sequential method around typical A/B test sample sizes.
We have run extensive simulations to validate that these confidence intervals satisfy the specified coverage guarantees.
For simplicity, when referring to the lower and upper bounds around a lift estimate, we generally use the phrase "confidence interval" regardless of which analysis method you might be using; in a Bayesian context, however, the term is a misnomer, as Bayesian methods approach the problem of inference differently than frequentist methods do. Bayesians therefore use credible intervals instead.
Although the epistemological underpinnings for constructing a Bayesian credible interval may share little with the process for constructing a fixed-sample confidence interval, the statistical procedure is nearly identical. That is, given our posterior distribution from equation 9, we simply look at the quantile function for that distribution to set our bounds:
It is important to note that the constraint in equation 12 (ensuring that the probability that the true lift falls within the bounds is at least as high as our confidence level) does not hold in the Bayesian case, simply because such a constraint is nonsensical in a Bayesian context. Instead, the above bounds in equation 15 (which constitute a credible interval) describe our beliefs about what the lift might plausibly be given our prior and the observed data. In particular, we can say that we expect with probability , given our prior and the observed data.
- Under reasonable assumptions, we can approximate the ratio of two normal distributions as a normal distribution centered on the ratio of the means. In essence, the approximation requires that the denominator be unlikely to be negative. Since all metrics are positive, the requirement boils down to whether the distribution of the denominator is sufficiently narrow (that is, has sufficiently low variance, relative to the mean). There is a short treatment of this approximation here, and a longer treatment in Díaz-Francés and Rubio (2004), "On the Existence of a Normal Approximation to the Distribution of the Ratio of Two Independent Normal Random Variables."↩
- For more on requirements for this approximation, see note above. In this case, the denominator (which must be unlikely to be negative for the approximation to hold) is the distribution of the treatment metric.↩
- We use a normal distribution because it is a convenient conjugate prior, meaning that we can update it with our (normally distributed) lift estimate and produce another normal distribution. In this case, we are assuming that the variance of the lift is known, that is, that is accurate. The choice of a wide prior is, in part, designed to compensate for this assumption. Evaluating and improving upon our choice of prior is an area of ongoing research.↩
- For a derivation, see Gelman et al., “Bayesian Data Analysis Third Edition” (2020), §2.5. An alternative derivation is provided in Murphy, “Conjugate Bayesian Analysis of the Gaussian Distribution” (2007). ↩
- This section follows Gelman et al., “Bayesian Data Analysis Third Edition” (2020), p. 40, with some slight tweaks to notation and ordering.↩
- In particular, if the sample variance of the data goes to zero (as would happen if our sample size gets very large), so will this quotient, meaning that the prior will have less and less of an effect on the posterior. Furthermore, it will get to zero faster, as we add samples, if our prior is weaker—that is, if it has a higher variance. On the other hand, if we have a strong prior belief, represented by a low value of , then moving this term toward zero requires more data (or, data that doesn't vary much).↩
- This is equivalent to saying that we want to limit the false positive rate to be no more than , which is how this constraint is typically framed in the context of null hypothsesis significance testing.↩
- Note that we do not assume, as is typical, that is constant across all sample sizes . See Howard et al., “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences”, p. 19 for a discussion of the implications of assuming that lift is invariant over sample sizes.↩
- In particular, Howard et al. assume unit variance while our lift estimate has variance , and we set our bounds around the mean estimated lift, rather than the sum. ↩