Skip to main content

Running Non-inferiority Tests

info

Eppo now support Non-inferiority tests with the Guardrail cutoffs.

This How To guide walks you through how to run non-inferiority tests in Eppo. This evaluation allows you to measure that a new treatment is not significantly worse than an existing or standard treatment in terms of effectiveness or safety.

Analysis

For this guide, we assume that the way you run the non-inferiority analysis is by running a one-sided hypothesis test on whether the impact is at worst c-c%, where c>0c>0 is your inferiority tolerance. The closer cc it is to 00, the stricter your test is; basically you need stronger evidence of before you call a test non-harmful.

The left endpoint of the confidence interval in Eppo has the same information as the non-inferiority test at half the significance level (α\alpha).

  • To perform the test in Eppo, visually check that the left side of the confidence interval is higher than your non-inferiority tolerance. If it's higher than the tolerance, then you can call the experiment non-harmful. If it's lower than the tolerance, then you don't have enough data to call it non-harmful.
    • If the right endpoint is lower than 00, then you can say the test is harmful. Note that with a permissive tolerance and high statistical power, both of these may happen at the same time!
    • For metrics where lower is better, flip everything above. You'll compare the right endpoint to a threshold above 0.
  • If want to run your non-inferiority test with α=0.025\alpha=0.025, then Eppo's confidence interval with the default of α=0.05\alpha=0.05 will be what you want. If you are using a one-sided test with α=0.05\alpha=0.05, then you would have to set the α=0.1\alpha=0.1 in Eppo to get the same results.

Example

Example experiment.png

In this example experiment, you might want to do a non-inferiority test on "Total revenue". Let's say you're willing to move forward as long as the impact is no worse than c=5c=-5%. You see that the left side of the confidence interval is 4.40-4.40%, so you can reject the null hypothesis, aka declare that the test caused no harm. If instead you had a stricter threshold of c=3c=-3%, you wouldn't have enough evidence (at that sample size) to make the call that the treatment caused no harm.