Background
One common question we get from experimentation customers is "My metric showed statistical significance yesterday and now it doesn't. How can that be?"
You were so excited on Tuesday to see that your metric had a statistically significant uplift of almost 75% only to have your hopes dashed on Wednesday when the latest recalculation of that metric showed a statistically inconclusive and smaller increase over the baseline.
A metric in Split is classified as statistically significant if the calculated p-value is less than or equal to the statistical significance setting for your organization (defaults to 0.05).
New Data = New Information
That said, a metric shifts from being statistically significant to being inconclusive because data that came in during the intervening period added to the total picture such that the calculations comparing the treatment to the baseline shifted to being inconclusive (p-value > 0.05). It may be because of seasonality (users behave differently on Wednesdays than they do on Tuesdays) or it may simply be because the effect of the new treatment was not influential enough to change the behavior of a batch of new users in the same way as the previous users. Finally, it's possible that the first calculation was a false positive and that additional data representative of the true effect of the treatment corrected that mistake.
One thing to note about the above example is that the confidence interval for the metric when statistically significant is rather wide (+4.20% to +144.42%) which suggests, that although a significant p-value was calculated, the data is still somewhat noisy and uncertain.
Bottom Line: No Peeking!
The possibility of early noisy data and false positives is a key reason it is important to decide how long your experiment will run prior to starting it and to not pick a winner sooner than that based on a metric reaching significance. In addition to multiple decision points increasing the chance of seeing a false positive, more data will give you more confidence in the durability of the effect you are seeing. Before starting your experiment, use the Sample size and sensitivity calculators to see how many users will have to encounter your experiment in order to see a meaningful swing in metrics. These calculators also take into account your seasonality cycle (typically seven days = one week), so that your experiment's duration will see an equal number of phases from the cycle. This video walks through use of the calculators.
Comments
0 comments
Please sign in to leave a comment.