Statistical nitty-gritty
The fixed-sample, sequential, and Bayesian methods all start from the same place: estimating the mean () and variance () of the lift; the former is an estimate of the size of the treatment effect (as a percentage of the baseline metric value), and the latter is an indication of how reliable that estimate is for predicting what will happen if you ship the treatment.
Once we have an estimate of the observed lift (both its mean and variance), we can use that to construct a confidence interval that describes the plausible values for the true lift.
Estimating lift
First, some notation:
-
For a given metric, we observe for each subject in the control group of size , and for each subject in the treatment group of size . The set of all observations in the control group is , and for the treatment group it is .
-
The true population mean (which we don't know!) of the metric among the control group is , and among the treatment group it is . Similarly, the true (unobserved) population variance is and
-
and are the averages of and across all subjects in the control group and treatment group, respectively:
-
and are the sample variances of and for the control group and treatment group, respectively:
Regardless of the distribution of and , the Central Limit Theorem (CLT) says that, when and are sufficiently large, the mean of and are each normally distributed; that is:
Eppo also supports ratio metrics, where instead of estimating the mean of a single metric, we are interested in understanding the ratio of two means (or, equivalently, the ratio of two sums), such as average order value or time per session. In this case (leaving out the / subscript for brevity), we observe from a sample of size the numerator metric and the denominator metric for each subject (with the set of all subject observations being and respectively) and are trying to estimate:
As above, and can be estimated by their sample means and , which are normally distributed around the true values with variances