Statistical nitty-gritty
The fixed-sample, sequential, and Bayesian methods all start from the same place: estimating the mean () and variance () of the lift; the former is an estimate of the size of the treatment effect (as a percentage of the baseline metric value), and the latter is an indication of how reliable that estimate is for predicting what will happen if you ship the treatment.
Once we have an estimate of the observed lift (both its mean and variance), we can use that to construct a confidence interval that describes the plausible values for the true lift.
Estimating lift
First, some notation:
-
For a given metric, we observe for each subject in the control group of size , and for each subject in the treatment group of size . The set of all observations in the control group is , and for the treatment group it is .
-
The true population mean (which we don't know!) of the metric among the control group is , and among the treatment group it is . Similarly, the true (unobserved) population variance is and
-
and are the averages of and across all subjects in the control group and treatment group, respectively:
-
and are the sample variances of and for the control group and treatment group, respectively:
Regardless of the distribution of and , the Central Limit Theorem (CLT) says that, when and are sufficiently large, the mean of and are each normally distributed; that is:
Eppo also supports ratio metrics, where instead of estimating the mean of a single metric, we are interested in understanding the ratio of two means (or, equivalently, the ratio of two sums), such as average order value or time per session. In this case (leaving out the / subscript for brevity), we observe from a sample of size the numerator metric and the denominator metric for each subject (with the set of all subject observations being and respectively) and are trying to estimate:
As above, and can be estimated by their sample means and , which are normally distributed around the true values with variances and , which in turn can be estimated by the sample variances and .
Estimating the ratio then becomes a matter of dividing the distributions for each component of the ratio; under some reasonable assumptions1, the resulting quotient is also approximately normally distributed:
The variance term can be calculated using the delta method (with representing the covariance between and ) as:
Note that the extra in the denominator of the last term is because we need the covariance between the sample averages.
For each variation, then, we can plug in the sample moments of the numerator and denominator metrics to calculate the values for the ratio:
The remaining calculations are the same for both simple and ratio metrics, noting that is treated as the sample mean, so it is used in place of in the lines below. Also, is used in place of the variance of the sample mean, , not in place of .
Frequentist analysis
We want to calculate the lift, which we'll call :
But, since we don't know the true values and , we'll need to instead estimate the lift. We know from the CLT, as shown in equation 3 above, that and are approximately normally distributed (for sufficiently large and ); furthermore, since and are independent, under reasonable assumptions the ratio is approximately normal.2 This allows us to model as a normal distribution:
Note that the calculation of the variance relies on the delta method.
Thus, we have estimated the lift as being normally distributed with a mean of and a variance of