Inconclusive tests, where the metrics for your split don’t reach statistical significance, can be a confusing and perhaps disappointing result to face. It means the data doesn’t support your original hypothesis (unless you were running a ‘do-no-harm’ test) and there isn’t much evidence to suggest that the treatment had any impact at all.
However, you shouldn’t be disheartened by inconclusive tests. For one thing, they are very common; successful tech giants such as Google and Bing report that only about 10% to 20% of their experiments generate positive results. They are also very valuable; whilst not having your hypothesis validated might feel disappointing, this actually is one of the main ways experimentation can bring value. Getting an inconclusive result can save you spending time and resources developing features that aren’t bringing the value you thought they were.
Interpreting inconclusive metrics
A metric is inconclusive when we cannot confidently say whether it had a desired or undesired impact. Alongside the measured difference between the treatments, we show the confidence interval around this measured value through the ‘impact lies between..’ text on the metric cards. When a metric is inconclusive, this impact lies between range will include 0% impact since we have not been able to rule out the possibility that the treatment has no impact on your metric.
One of the first things to look out for when interpreting statistically inconclusive metrics is the ‘power’ of the metric. Not reaching statistical significance does not mean that your treatment had no impact -- it means that there wasn’t enough evidence to say that there was an impact. This is a subtle but important difference.
Each metric for a given test will only have enough power to detect impacts greater than a given size -- this is sometimes called the metric’s minimum likely detectable effect (MLDE). Getting an inconclusive result means it is unlikely that your test had an impact greater than this MLDE, in either direction, but there could be an impact lower than this value that the test simply wasn’t able to detect. You can find this value by hovering over the question mark on your inconclusive card.
For example in the image above, the hover text implies that you can be confident your treatment didn’t impact the number of bookings per user by more than a relative change of 2.39%, but that there may be an impact smaller than that value.
If this number is too large, and you want to know about potential impacts smaller than this, then you will need a larger sample size. This may mean running another experiment either for a longer period of time or with a different percentage rollout. You can use the sensitivity calculators on this page to help understand how long you would need to run an experiment to get the required sensitivity.
Learning from inconclusive results
Inconclusive tests can still provide valuable learning opportunities. It is worth looking in to the results to see if certain segments responded differently to your treatment. Whilst this may not provide enough evidence to draw strong conclusions on it's own, it can provide insights and ideas for next iterations or further testing.
For example, you may see that although there was no significant impact across all users, there was a notable impact in the desired direction for a subset of your users, such as premium users or users from a particular geographic location. This might suggest it's worthwhile to repeat the test specifically targeting that set of users.
Inconclusive results may also indicate that some of your assumptions about your users are invalid, for example what they want or what they find useful. It may also suggest that problems, or pain points, are not what you thought they were. These are all valuable lessons to learn to help inform future hypotheses and tests.
When you are unsure which treatment performed better, it is often best to keep the current, default state, the control treatment. If there is no reason to believe making a change will bring any benefit, then sticking with the current state will avoid unnecessary changes to the user experience.
However if you have your own reasons to favour one treatment over the other, perhaps it is cheaper or easier to maintain, then an inconclusive result can give you confidence that making that change will not disadvantage your users and you can safely go with your preference.
Finally, if you are seeing inconclusive results too often, it may be a sign that you should test bigger, bolder changes. Subtle changes often have subtle impacts, which require very high levels of traffic to be able to detect. Sometimes you need to go big; making more dramatic and visible changes to your product will be more likely to produce significant results and controlled experimentation provides a way to safely test your big ideas.