# Statistical nitty-gritty

The fixed-sample, sequential, and Bayesian methods all start from the same
place: estimating the **mean** ($\mu$)
and **variance** ($\sigma^2$) of the lift;
the former is an estimate of the size of the treatment effect (as a percentage
of the baseline metric value), and the latter is an indication of how reliable
that estimate is for predicting what will happen if you ship the treatment.

Once we have an estimate of the observed lift (both its mean and variance), we can use that to construct a confidence interval that describes the plausible values for the true lift.

## Estimating lift

First, some notation:

For a given metric, we observe $y_i$ for each subject $i$ in the control group of size $n_C$, and $y_j$ for each subject $j$ in the treatment group of size $n_T$. The set of all observations in the control group is $\bold{Y}_C$, and for the treatment group it is $\bold{Y}_T$.

The

*true*population mean (which we don't know!) of the metric among the control group is $\mu_C$, and among the treatment group it is $\mu_T$. Similarly, the*true*(unobserved) population variance is $\sigma^2_C$ and $\sigma^2_T$$m_C$ and $m_T$ are the averages of $\bold{Y}_C$ and $\bold{Y}_T$ across all subjects in the control group and treatment group, respectively:

$\begin{equation} \begin{split} m_C &= \frac{\sum_{i=1}^{n_C}{y_i}}{n_C} \\[1.2em] m_T &= \frac{\sum_{j=1}^{n_T}{y_j}}{n_T} \end{split} \end{equation}$$s^2_C$ and $s^2_T$ are the

*sample*variances of $\bold{Y}_C$ and $\bold{Y}_T$ for the control group and treatment group, respectively:$\begin{equation} \begin{split} s^2_C &= \frac{n_C}{n_C - 1}\left(\frac{\sum_{i=1}^{n_C}{y_i^2}}{n_C} - m_C^2\right) \\[1.2em] s^2_T &= \frac{n_T}{n_T - 1}\left(\frac{\sum_{j=1}^{n_T}{y_j^2}}{n_T} - m_T^2\right) \end{split} \end{equation}$

Regardless of the distribution of $\bold{Y}_C$ and $\bold{Y}_T$, the
Central Limit Theorem (CLT)
says that, when $n_C$ and $n_T$ are sufficiently large, the *mean* of
$\bold{Y}_C$ and $\bold{Y}_T$ are each normally distributed;
that is:

Eppo also supports ratio metrics, where instead of estimating the mean of a single metric, we are interested in understanding the ratio of two means (or, equivalently, the ratio of two sums), such as average order value or time per session. In this case (leaving out the $\small{C}$ / $\small{T}$ subscript for brevity), we observe from a sample of size $n$ the numerator metric $y_i$ and the denominator metric $z_i$ for each subject (with the set of all subject observations being $\bold{Y}$ and $\bold{Z}$ respectively) and are trying to estimate:

As above, $\mu_{\tiny{Y}}$ and $\mu_{\tiny{Z}}$ can be estimated by their sample means $m_{\tiny{Y}}$ and $m_{\tiny{Z}}$, which are normally distributed around the true values with variances $\frac{\sigma^2_{\tiny{Y}}}{n}$ and $\frac{\sigma^2_{\tiny{Z}}}{n}$, which in turn can be estimated by the sample variances $s_{\tiny{Y}}$ and $s_{\tiny{Z}}$.

Estimating the ratio $\mu$ then becomes a matter of dividing the distributions
for each component of the ratio; under some reasonable assumptions^{1},
the resulting quotient is also approximately normally distributed:

The variance term can be calculated using the delta method (with $\sigma_{\!\tiny{\textit{YZ}}}$ representing the covariance between $\bold{Y}$ and $\bold{Z}$) as:

Note that the extra $n$ in the denominator of the last term is because we need the covariance between the sample averages.

For each variation, then, we can plug in the sample moments of the numerator and denominator metrics to calculate the values for the ratio:

The below analysis then simply uses these values for ratio metrics instead of the simple sample means and variances for each variation.

### Frequentist analysis

We want to calculate the lift, which we'll call $\Delta$:

But, since we don't know the true values $\mu_T$ and $\mu_C$,
we'll need to instead *estimate* the lift. We know from the CLT, as shown in
equation 3 above, that $m_T$ and $m_C$ are approximately normally
distributed (for sufficiently large $n_C$ and $n_T$); furthermore, since $m_T$
and $m_C$ are independent, under reasonable assumptions the ratio
$\frac{m_T}{m_C}$ is approximately normal.^{2} This allows us to model
$\hat{\Delta}$ as a normal distribution:

Note that the calculation of the variance $\hat{\sigma}^2_{\tiny{\Delta}}$ relies on the delta method.

Thus, we have estimated the lift as being normally distributed with a mean of $\hat{\mu}_{\tiny{\Delta}}$ and a variance of $\hat{\sigma}^2_{\tiny{\Delta}}$. For the frequentist methods (fixed-sample and sequential), that's where we end, in terms of estimating the lift (see below for turning that estimate into confidence intervals). For Bayesian, however, there's one more step.

If you are using CUPED, then the estimate of the lift will be a bit more
complicated; we still model the lift as a normal distribution, but the mean and
variance are computed after using a ridge regression to account for random (that
is, *not correlated with the treatment assignment*) pre-experiment differences
between the groups. See the CUPED docs for more information.

### Bayesian analysis

In a Bayesian framework, you start with a *prior distribution*, which describes
what you believe before running the experiment. Then, you run the experiment and
collect data, which you use to *update* your prior: in essence, you combine your
pre-experiment beliefs about what the lift would be, with the evidence you've
gotten from the experiment, into a *new* set of beliefs, called the *posterior*
(because it comes *after* gathering data). The estimated average lift is then
just the mean of this posterior distribution.

#### Setting the prior

In our implementation of the Bayesian approach, we use a normal
distribution^{3} as our prior for the lift:

In other words, our prior is that the lift, on average, will be zero, and that for each metric, about 50% of experiments will show a lift between -21% and +21%, and about 95% of experiments will show a lift between -62% and +62%; from our experience running experiments, this is a fairly conservative prior, as having lifts over ±50% is extremely rare.

#### Updating the prior

The evidence from the experiment is construed as a normal distribution $\hat{\Delta}$ just as with the frequentist methods (see equations 2–4 above). However, for the Bayesian method we use this evidence to update the above prior, and the result is our posterior.

Specifically, our posterior is a normal distribution with mean $\mu_{post}$ and
variance $\sigma_{post}^2$, where:^{6}

In other words, our posterior mean is the weighted average of the prior and the
observed data, where for each term the weight is the precision (that is, the
*inverse* of the variance). The variance, meanwhile, is related to the harmonic
mean of the variances of the prior and the observed data.

We can also rewrite^{4} the posterior mean $\mu_{post}$
(equation 6) as:

Note that
$\tfrac{\hat{\sigma}^2_{\tiny{\Delta}}}{\hat{\sigma}^2_{\tiny{\Delta}} + \sigma^2_{prior}}$
reflects how spread out the data are, relative to the prior,^{5} and
$(\hat{\mu}_{\tiny{\Delta}} - \mu_{prior})$ is the distance between what we've observed
and our prior expectation; thus, we can interpret equation 8 as showing that our posterior
lift is the lift we observed in the experiment *shrunk toward the prior* (that is, toward 0),
and that the shrinkage will be larger if our data is noisy
(such as happens when we have few observations) and/or our prior is very strong
(that is, $\sigma^2_{prior}$ is low).

#### Arriving at the posterior

Since our prior is that the lift is zero (that is, $\mu_{prior}=0$), we can simplify the statement of our posterior (equations 7 and 8) to:

## Confidence intervals

Now that we have (normal) distributions describing our estimate of the lift, that is, an estimate for the mean and variance of the lift, we can construct confidence intervals. The width of those intervals will depend on the confidence level, $c \in (0, 1)$, which represents the desired likelihood of the interval including the true lift.

### Frequentist analysis

For frequentist methods, we want to ensure that our confidence interval contains
the true lift $\Delta$ with some minimum probability $c$
(our *confidence level*).^{7} Specifically, after $t$ observations
(where $t = n_C + n_T$ and can be thought of as a function of *time*), we
want to set the lower and upper bounds of our confidence interval $(L_t, U_t)$
such that:^{8}

#### Fixed-sample

Fixed-sample analysis assumes that the results are only looked at once, and therefore only a single interval need be constructed—so we can ignore the $t$ subscripts in the above constraint (eq. 15). Given that we have constructed a normal distribution that describes the lift estimate, we can therefore simply use that distribution's quantile function $F_{\tiny{\Delta}}^{-1}(p) = \hat{\mu}_{\tiny{\Delta}} - \hat{\sigma}_{\tiny{\Delta}} \mathop{\Phi^{-1}}(p)$, where $\mathop{\Phi^{-1}}$ is the quantile function for the standard normal distribution and $p \in (0, 1)$ is the desired quantile. Specifically, this gives us lower and upper bounds for a given confidence level $c$ (e.g., $c = 0.95$ for a 95% confidence interval):

#### Sequential

For the sequential analysis method, we need to ensure that the constraint in
equation 15 holds *for all $t$ at once*. That is:

This means that we do not have a single *confidence interval*, but rather a
*confidence sequence*: an infinite sequence of confidence intervals such that,
not only is each individual interval valid for controlling the rate of false
positives, but the aggregation of all intervals is valid as well. In other
words, a confidence sequence provides statistical guarantees while allowing
you to peek at the results as many times as you want (including after every
single observation) and to stop the experiment at any time.

The method we use for constructing the bounds $L_t^c$ and $U_t^c$ comes from Howard et al., “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences” and has the following useful properties:

- The statistical guarantees are valid even for very broad assumptions about the underlying distribution of $\Delta$.
- You do not need to predeterimine the sample size.
- You can peek at results any number of times, and can decide, based on what you see, to shut down the experiment or keep it running to collect more data.
- You can use any stopping rule you like, meaning, for example, you can change the confidence level in the middle of the experiment.
- As you collect more data, the width of the intervals will tend to get smaller and smaller, and eventually will approach zero.
- Although the confidence intervals are wider than for fixed-sample analysis, the penalty incurred for the additional flexibility and generality is smaller than that from a host of previous methods.

We use a slightly modified version of equation 14 from the reference, with some
changes in notation, to construct our bounds.^{9} Specifically, using
the estimated lift $\hat{\mu}_{\tiny{\Delta}}$ from equation 3, the
estimated standard error of the lift $\hat{\sigma}_{\tiny{\Delta}}$ from equation 4,
the confidence level $c \in (0, 1)$, and the total sample size $t = n_C + n_T$
:

where $N_{tune}$ is a tuning parameter that is used to determine where the ratio between the sequential confidence interval width and the fixed-sample confidence interval width is minimized; we set $N_{tune}=10{,}000$ to try to minimize the cost (in terms of additional sample needed) of the sequential method around typical A/B test sample sizes.

We have run extensive simulations to validate that these confidence intervals satisfy the specified coverage guarantees.

### Bayesian

For simplicity, when referring to the lower and upper bounds around a lift estimate, we generally use the phrase "confidence interval" regardless of which analysis method you might be using; in a Bayesian context, however, the term is a misnomer, as Bayesian methods approach the problem of inference differently than frequentist methods do. Bayesians therefore use credible intervals instead.

Although the epistemological underpinnings for constructing a Bayesian credible interval may share little with the process for constructing a fixed-sample confidence interval, the statistical procedure is nearly identical. That is, given our posterior distribution $\hat{\Delta}_{post}$ from equation 9, we simply look at the quantile function for that distribution to set our bounds:

It is important to note that the constraint in equation 12 (ensuring that the
probability that the true lift falls within the bounds $(L^c, U^c)$ is at least
as high as our confidence level) does *not* hold in the Bayesian case, simply
because such a constraint is nonsensical in a Bayesian context. Instead, the
above bounds in equation 15 (which constitute a *credible interval*) describe
our beliefs about what the lift might plausibly be given our prior and the
observed data. In particular, we can say that we expect $\Delta \in (L^c, U^c)$
with probability $c$, given our prior and the observed data.

- Under reasonable assumptions, we can approximate the ratio of two normal distributions as a normal distribution centered on the ratio of the means. In essence, the approximation requires that the denominator be unlikely to be negative. Since all metrics are positive, the requirement boils down to whether the distribution of the denominator is sufficiently narrow (that is, has sufficiently low variance, relative to the mean). There is a short treatment of this approximation here, and a longer treatment in Díaz-Francés and Rubio (2004), "On the Existence of a Normal Approximation to the Distribution of the Ratio of Two Independent Normal Random Variables."↩
- For more on requirements for this approximation, see note above. In this case, the denominator (which must be unlikely to be negative for the approximation to hold) is the distribution of the treatment metric.↩
- We use a normal distribution because it is a convenient conjugate prior, meaning that we can update it with our (normally distributed) lift estimate and produce another normal distribution. In this case, we are assuming that the variance of the lift is known, that is, that $\hat{\sigma}^2_{\tiny{\Delta}}$ is accurate. The choice of a wide prior is, in part, designed to compensate for this assumption. Evaluating and improving upon our choice of prior is an area of ongoing research.↩
- For a derivation, see Gelman et al., “Bayesian Data Analysis Third Edition” (2020), §2.5. An alternative derivation is provided in Murphy, “Conjugate Bayesian Analysis of the Gaussian Distribution” (2007). ↩
- This section follows Gelman et al., “Bayesian Data Analysis Third Edition” (2020), p. 40, with some slight tweaks to notation and ordering.↩
- In particular, if the sample variance of the data goes to zero (as
would happen if our sample size gets very large), so will this quotient,
meaning that the prior will have less and less of an effect on the
posterior. Furthermore, it will get to zero
*faster*, as we add samples, if our prior is*weaker*—that is, if it has a higher variance. On the other hand, if we have a*strong*prior belief, represented by a*low*value of $\sigma^2_{prior}$, then moving this term toward zero requires*more*data (or, data that doesn't vary much).↩ - This is equivalent to saying that we want to limit the
*false positive rate*to be no more than $1-c$, which is how this constraint is typically framed in the context of null hypothsesis significance testing.↩ - Note that we do not assume, as is typical, that $\Delta$ is constant across all sample sizes $t$. See Howard et al., “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences”, p. 19 for a discussion of the implications of assuming that lift is invariant over sample sizes.↩
- In particular, Howard et al. assume unit variance while our lift estimate has variance $\hat{\sigma}^2_{\tiny{\Delta}}$, and we set our bounds around the mean estimated lift, rather than the sum. ↩