Once you have started to collect data on some experiments, you'll want to start reviewing the results! Eppo allows you to see experiment results across multiple experiments, zoom in on a specific experiment, estimate global impact, and then slice and dice those results by looking at different segments and metric cuts.
Viewing multiple experiments
When you click on the Experiments tab, you will see the experiment list view, which shows all of your experiments. You can filter this list by experiment name, status, entity, or owner, or just show experiments you have starred.
Overview of an experiment's results
Clicking on the name of an experiment will take you to the experiment detail view, which shows the effects of each treatment variation, compared to control. Within each variation, for each metric that you have added to the experiment, we display the (per subject) average value for the control variation, as well as the estimate of the relative lift (that is, the percentage change from the control value) caused by that treatment variation.
If you hover over the lift, you can see the metric values for both control and treatment variations, as well as the sum of the underlying fact(s) and the number of subjects assigned to each variation.
Depending on the metric settings, we may remove outliers from the raw data in order to improve the quality of our lift estimates, and so the average and total values displayed in this popover might differ from those displayed in other tools.
In addition, in many cases (in particular, if CUPED is enabled), we perform additional processing on the data before estimating the lift (such as correcting for imbalances across variations along different dimensions). For this reason, the lift displayed on the details page may not match the lift calculated directly from the numbers in this popover.
See Basics of estimating lift for more information on how we estimate lift in different circumstances.
To the right of the details page, we display the estimated lift graphically as a black vertical bar, as well as a confidence interval that shows the values of the lift that we consider plausible given the observed data from the experiment. The precise definition of plausible is determined by, among other things, the confidence level, which defaults to 95%: we set the lower and upper bounds of the confidence interval such that it will contain the true lift at least 95% (or whatever confidence level you've selected) of the time.1 In addition, Eppo has several different methods for calculating the confidence intervals, which can be set at the company or experiment level.
Randomly allocating users to each variation means that the lift we observe in the experiment data should be a good estimate2 of the lift you would observe if you shipped the treatment. However, any time you estimate something for a whole population using measurements from just a portion of that population (in this case, the subjects in the treatment variation), there's a risk that the sample you chose to observe behaves differently from the population as a whole. Specifically, there's always a chance that a bunch of really active subjects, instead of being evenly split across variations, happened to end up concentrated in the control variation or the treatment variation; that would mean that the lift you observed in the experiment would be lower or higher, respectively, than what you'd observe if you shipped the treatment. This is the uncertainty represented by the confidence interval.3 (This is why CUPED, which corrects some of this imbalance, can often greatly reduce the width of the confidence intervals.)
If you hover over the confidence interval, you will see precise values for the upper and lower bounds for this confidence interval:
If the confidence interval is entirely above or below zero, it means that the data is consistent with the treatment moving the metric up or down, respectively. In this case, the confidence interval bar itself, as well as the lift estimate, will be highlighted green if the movement is good (in which case there will also be a 🎉 symbol in the lift estimate box), and red if the movement is bad.
In general, a positive lift will be good (colored green) and a negative lift will be bad (colored red). However, for metrics such as page load time or app crashes, a higher number is bad. We call these reversed metrics. If you've set the "Desired Change" field in the fact definition to "Metric Decreasing", then positive lifts will be in red and negative lifts will be in green.
The confidence level is set as part of the analysis plan, and you can change the company-wide default on the Admin tab or set an experiment-specific confidence level on the experiment Set Up page. The confidence level being used for any experiment is displayed on the experiment detail page below the table of metric results:
The main Confidence intervals tab () displays the experiment results observed for subjects in the experiment, but you may want to understand the treatment effect globally; a large lift in an experiment that targets a tiny portion of your users might have a negligible business impact.
You can click on the Impact accounting icon () to show, for each metric, the coverage (the share of all events that are part of the experiment) and global lift (the expected increase in the metric if the treatment variation were rolled out to 100%, as a percentage of the global total metric value).
In general, the confidence intervals will provide the information you need to make a ship/no-ship decision. However, if you want to see additional statistical details, you can click on the Statistical details icon () to display them. The actual values being shown will differ based on which analysis method is being used: sequential and fixed-sample methods will show the frequentist statistics, while Bayesian statistics will differ to reflect the different decision-making processes compatible with that method.
Frequentist analysis methods (that is, sequential and fixed-sample) will show the following three statistics:
Standard error: The standard error of the lift, expressed in percentage points.
p-value: The p-value represents the likelihood that an A/A test (that is, an experiment where the treatment is identical to the control) would produce a lift of a magnitude (that is, ignoring the sign) at least as large as the one observed in the data. The p-value directly corresponds to the bounds of the confidence interval: if your confidence level is 95%, for example, a p-value less than 0.05 (that is, ) means that the confidence interval around the lift will just exclude zero.
Z-score: Also called the standard score, the Z-score represents how far away the observed lift is from zero, measured as a number of standard deviations. The Z-score is simply a transformation of the p-value: it provides no additional information, but it is sometimes easier to use, particularly when the p-value is very small.
Bayesian methods rely on a different way of thinking about probabilities, and thus use different statistics to summarize the results of an experiment.
Probability Beats Control: The probability that the treatment variation is superior to the control variation; in other words, the chance that the lift is good (positive for most metrics, but negative for reversed metrics)
Probability Better Than MDE: Bayesian analyses comparing two distributions (in our case the average metric values in the treatment variation compared to the average metric values in the control variation) sometimes refer to the Region of Practical Equivalence (ROPE), which is the amount of difference between the distributions that is, practically speaking, trivial. In other words, we might care not just about whether the treatment is strictly better than control, but whether it is better enough that the difference is meaningful (in business terms). Since the MDE represents the smallest lift that the business cares about, it is also a useful boundary for the ROPE.
For example, if the MDE for a metric was a 5% lift, this statistic would calculate the portion of the posterior lift distribution that is above 5% (assuming positive lift is good).
Risk: The expected metric loss (measured in lift terms, that is, as a percentage of the control value) if the lift were in fact negative (or, for reversed metrics, positive). In other words, even if the bulk of the distribution indicates that the treatment is better than control, some portion is going to be on the other side: there's some chance that in fact control is better than treatment. The risk measures, in that case, what the expected value (that is, the average of all lifts weighted by their likelihood under the posterior distribution) of the lift would be.
Segments and filters
You are able to filter experiment results to specific subgroups by selecting the filter menu at the top right corner. We provide two distinct options to filter results.
Segments are pre-specified subsets of users based on multiple attributes. For example, you might have a "North America mobile users", created by filtering for Country = 'Canada', 'USA', or 'Mexico', and Device = 'Mobile'. You can then attach such a segment to any experiment and Eppo pre-computes experiment results for such a subgroup.
Note that when you first add a segment to an experiment, you have to manually refresh the results to compute the results for the segment.
Single dimension filter
For quick investigations, we also provide the single dimension filter: here you can select a single dimension (e.g. Country) and single value (e.g. 'USA'). These results are available immediately -- no need to manually refresh the results.
You can also further investigate the performance of an individual metric by clicking on navigator icon the next to the metric name. This will take you to the Metric explore page where you can further slice the experiment results by different dimensions that have been configured, for example user persona, or browser, etc.
- For more on how we define confidence intervals, how to interpret them, and a precise technical definition of what it means to be "X% confident", see Lift estimates and confidence intervals.↩
- In technical terms, it is consistent, meaning that it gets closer to the true lift as the number of subjects in the experiment increases.↩
- Note that there are other sources of uncertainty that are not included in how the confidence interval is calculated. For example, if you start your experiment in June, but don't ship it until December, the behavior of users after being exposed to the treatment in production may be very different from the behavior observed during the experiment: there may be seasonal differences in how users react to the treatment, perhaps; there may be external circumstances (like trends, or different economic conditions) that have changed and affect how users react to the treatment; or (especially if your product is growing and adding more and more users) the users that use the product in December might be different than those that were part of the experiment in June.↩