One common question we get from experimentation customers is: My metric showed statistical significance yesterday and now it doesn't. How can that be?
When you use fixed horizon to calculate your metrics, it is possible that your metrics change from conclusive to inconclusive. You were so excited on Tuesday to see that your metric had a statistically significant uplift of almost 75% only to have your hopes dashed on Wednesday when the latest recalculation of that metric showed a statistically inconclusive and smaller increase over the baseline.
A metric in Split is classified as statistically significant if the calculated p-value is less than or equal to the statistical significance setting for your organization (defaults to 0.05).
New Data = New Information
That said, a metric shifts from being statistically significant to being inconclusive because data that came in during the intervening period added to the total picture such that the calculations comparing the treatment to the baseline shifted to being inconclusive (p-value > 0.05). It may be because of seasonality (users behave differently on Wednesdays than they do on Tuesdays) or it may simply be because the effect of the new treatment was not influential enough to change the behavior of a batch of new users in the same way as the previous users. Finally, it's possible that the first calculation was a false positive and that additional data representative of the true effect of the treatment corrected that mistake.
One thing to note about the above example is that the confidence interval for the metric when statistically significant is rather wide (+4.20% to +144.42%) which suggests, that although a significant p-value was calculated, the data is still somewhat noisy and uncertain.
Bottom Line: No Peeking!
The possibility of early noisy data and false positives is a key reason it is important to decide how long your experiment runs prior to starting it and to not pick a winner sooner than that based on a metric reaching significance. In addition to multiple decision points increasing the chance of seeing a false positive, more data gives you more confidence in the durability of the effect you are seeing. Before starting your experiment, use the Sample size and sensitivity calculators to see how many users have to encounter your experiment in order to see a meaningful swing in metrics. These calculators also take into account your seasonality cycle (typically seven days = one week), so that your experiment's duration sees an equal number of phases from the cycle.
If you are using the sequential testing method, you can peek at your results. Also, with these types of tests, your results are always valid, and they won't change once they reach significance ( unless there is strong seasonality). Once you set this feature to on, you can check your results as many times as you want as soon as you start your measurements.