When you click in to a metric card from the metrics impact tab, you can see more in-depth details of the performance of the two selected treatments, as well as how these have changed over time. You can find more information on what is shown in the metrics details and trends view in this article.
We show the results, along with the output of our statistical analyses such as the p-value and the error margin, so that you can see how the results would have looked had you checked the impact at a particular point in the past. For example, you can check how the results would have looked at the end of a review period, even if the review period has now passed.
Concluding on interim data
Although we show the statistical results for multiple interim points, we caution against drawing conclusions from interim data. Each interim point at which the data is analysed has its own chance of bringing a false positive result, so looking at more points brings more chance of a false positive. You can read more about statistical significance and false positives in this article. If you were to look at all the p-values from the interim analysis points and claim a significant result if any of those were below your significance threshold, then you would have a substantially higher false positive rate than expected based on the threshold alone. For example, you would have far more than a 5% chance of seeing a falsely significant result when using a significance threshold of 0.05, if you concluded on any significant p-value shown in the metric details and trends view. This is because there are multiple chances for you to happen upon a time when the natural noise in the data happened to look like a real impact. For this reason it is good practice to only draw conclusions from your experiment at the predetermined conclusion point(s), such as at the end of the review period. You can read more about reviewing metrics during an experiment in this article.
Interpreting the line chart and trends
The line chart provides a visualisation of how the measured impact has changed since the beginning of the split. This may be useful for gaining insights on any seasonality or for identifying any unexpected sudden changes in the performance of the treatments.
However it is important to remember that there will naturally be noise and variation in the data, especially when the sample size is low at the beginning of a split, so some differences in the measured impact over time are to be expected. Additionally, since the data is cumulative, it may be expected that the impact changes as the run time of your split increases. For example, the fraction of users who have done an event may be expected to increase over time simply because the users have had more time to do the action.
The image below shows the impact over time line chart for an example AA test - a split where there is no true difference between the performance of the treatments. Despite there being no difference between the treatments, and hence a constant true impact of zero, the line chart shows a large measured difference at the beginning, and an apparent trend upwards over time - this is due only to noise in the data at the early stages of the split when the sample size is low, and the measured impact moving towards the true value as more data arrives.
Note also that in the chart above there are 3 calculation buckets for which the error margin is entirely below zero, and hence the p-values at those points in time would imply a statistically significant impact. This is again due to noise and the unavoidable chance of false positive results - if you weren't aware of the risk of 'peeking' at the data, or of considering multiple evaluations of your split at different points in time, then you may have concluded that a meaningful impact had been detected. However, by following the recommended practice of concluding only at the predetermined end time of your split you would eventually have seen a statistically inconclusive result as expected for an AA test.