Statistical nitty-gritty
The fixed-sample, sequential, and Bayesian methods all start from the same place: estimating the mean () and variance () of the lift; the former is an estimate of the size of the treatment effect (as a percentage of the baseline metric value), and the latter is an indication of how reliable that estimate is for predicting what will happen if you ship the treatment.
Once we have an estimate of the observed lift (both its mean and variance), we can use that to construct a confidence interval that describes the plausible values for the true lift.
Estimating lift
First, some notation:
-
For a given metric, we observe for each subject in the control group of size , and for each subject in the treatment group of size . The set of all observations in the control group is , and for the treatment group it is .
-
The true population mean (which we don't know!) of the metric among the control group is , and among the treatment group it is . Similarly, the true (unobserved) population variance is and
-
and are the averages of and across all subjects in the control group and treatment group, respectively:
-
and are the sample variances of and for the control group and treatment group, respectively:
Regardless of the distribution of and , the Central Limit Theorem (CLT) says that, when and are sufficiently large, the mean of and are each normally distributed; that is:
Eppo also supports ratio metrics, where instead of estimating the mean of a single metric, we are interested in understanding the ratio of two means (or, equivalently, the ratio of two sums), such as average order value or time per session. In this case (leaving out the / subscript for brevity), we observe from a sample of size the numerator metric and the denominator metric for each subject (with the set of all subject observations being and respectively) and are trying to estimate:
As above, and can be estimated by their sample means and , which are normally distributed around the true values with variances and , which in turn can be estimated by the sample variances and .
Estimating the ratio then becomes a matter of dividing the distributions for each component of the ratio; under some reasonable assumptions1, the resulting quotient is also approximately normally distributed:
The variance term can be calculated using the delta method (with representing the covariance between and ) as:
Note that the extra in the denominator of the last term is because we need the covariance between the sample averages.
For each variation, then, we can plug in the sample moments of the numerator and denominator metrics to calculate the values for the ratio:
The remaining calculations are the same for both simple and ratio metrics, noting that is treated as the sample mean, so it is used in place of in the lines below. Also, is used in place of the variance of the sample mean, , not in place of .
Frequentist analysis
We want to calculate the lift, which we'll call :
But, since we don't know the true values and , we'll need to instead estimate the lift. We know from the CLT, as shown in equation 3 above, that and are approximately normally distributed (for sufficiently large and ); furthermore, since and are independent, under reasonable assumptions the ratio is approximately normal.2 This allows us to model as a normal distribution:
Note that the calculation of the variance relies on the delta method.
Thus, we have estimated the lift as being normally distributed with a mean of and a variance of . For the frequentist methods (fixed-sample and sequential), that's where we end, in terms of estimating the lift (see below for turning that estimate into confidence intervals). For Bayesian, however, there's one more step.
If you are using CUPED, then the estimate of the lift will be a bit more complicated; we still model the lift as a normal distribution, but the mean and variance are computed after using a ridge regression to account for random (that is, not correlated with the treatment assignment) pre-experiment differences between the groups. See the CUPED docs for more information.
Bayesian analysis
In a Bayesian framework, you start with a prior distribution, which describes what you believe before running the experiment. Then, you run the experiment and collect data, which you use to update your prior: in essence, you combine your pre-experiment beliefs about what the lift would be, with the evidence you've gotten from the experiment, into a new set of beliefs, called the posterior (because it comes after gathering data). The estimated average lift is then just the mean of this posterior distribution.
Setting the prior
In our implementation of the Bayesian approach, we use a normal distribution3 as our prior for the lift:
In other words, our prior is that the lift, on average, will be zero, with a standard deviation of . You can adjust the prior standard deviation in the Statistical Analysis Plan admin settings to reflect your prior knowledge of how common or rare large lifts are. This setting is shared across all experiments using Bayesian analysis. (To change the prior standard deviation when Bayesian is not the company default, temporarily make Bayesian the default, change the prior standard deviation, and save; then revert to the previous analysis method.)
Updating the prior
The evidence from the experiment is constructed as a normal distribution just as with the frequentist methods (see equations 2–4 above). However, for the Bayesian method we use this evidence to update the above prior, and the result is our posterior.
Specifically, our posterior is a normal distribution with mean and variance , where:4
In other words, our posterior mean is the weighted average of the prior and the observed data, where for each term the weight is the precision (that is, the inverse of the variance). The variance, meanwhile, is related to the harmonic mean of the variances of the prior and the observed data.
We can also rewrite5 the posterior mean (equation 6) as:
Note that reflects how spread out the data are, relative to the prior,6 and is the distance between what we've observed and our prior expectation; thus, we can interpret equation 8 as showing that our posterior lift is the lift we observed in the experiment shrunk toward the prior (that is, toward 0), and that the shrinkage will be larger if our data is noisy (such as happens when we have few observations) and/or our prior is very strong (that is, is low).
Arriving at the posterior
Since our prior is that the lift is zero (that is, ), we can simplify the statement of our posterior (equations 7 and 8) to:
Confidence intervals
Now that we have (normal) distributions describing our estimate of the lift, that is, an estimate for the mean and variance of the lift, we can construct confidence intervals. The width of those intervals will depend on the confidence level, , which represents the desired likelihood of the interval including the true lift.
Frequentist analysis
For frequentist methods, we want to ensure that our confidence interval contains the true lift with some minimum probability (our confidence level).7 Specifically, after observations (where and can be thought of as a function of time), we want to set the lower and upper bounds of our confidence interval such that:8
Fixed-sample
Fixed-sample analysis assumes that the results are only looked at once, and therefore only a single interval need be constructed—so we can ignore the subscripts in the above constraint (eq. 15). Given that we have constructed a normal distribution that describes the lift estimate, we can therefore simply use that distribution's quantile function , where is the quantile function for the standard normal distribution and is the desired quantile. Specifically, this gives us lower and upper bounds for a given confidence level (e.g., for a 95% confidence interval):
Sequential
For the sequential analysis method, we need to ensure that the constraint in equation 15 holds for all at once. That is:
This means that we do not have a single confidence interval, but rather a confidence sequence: an infinite sequence of confidence intervals such that, not only is each individual interval valid for controlling the rate of false positives, but the aggregation of all intervals is valid as well. In other words, a confidence sequence provides statistical guarantees while allowing you to peek at the results as many times as you want (including after every single observation) and to stop the experiment at any time.
The method we use for constructing the bounds and comes from Howard et al., “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences” and has the following useful properties:
- The statistical guarantees are valid even for very broad assumptions about the underlying distribution of .
- You do not need to predetermine the sample size.
- You can peek at results any number of times, and can decide, based on what you see, to shut down the experiment or keep it running to collect more data.
- You can use any stopping rule you like, meaning, for example, you can change the confidence level in the middle of the experiment.
- As you collect more data, the width of the intervals will tend to get smaller and smaller, and eventually will approach zero.
- Although the confidence intervals are wider than for fixed-sample analysis, the penalty incurred for the additional flexibility and generality is smaller than that from a host of previous methods.
We use a slightly modified version of equation 14 from the reference, with some changes in notation, to construct our bounds.9 Specifically, using the estimated lift from equation 7, the estimated standard error of the lift from equation 8, the confidence level , and the total sample size :