Alert Policies are currently in beta
Please contact email@example.com if you would like more information or if you would like early access.
For alert policies, rather than testing for statistically significant evidence of any impact as we do for our standard metric analyses, we test for significant evidence of an impact larger than your chosen degradation threshold, in the opposite direction to the metric’s desired direction.
This means that, by design, if your observed impact is equal to the set threshold it will not fire an alert. Instead, an alert only fires when the entire confidence interval - which represents the range of likely values - is above or below your set threshold. Hence, it is not unexpected for you to see a degradation in the metrics larger than your set threshold without an alert firing. This would just mean that the statistics suggest this result could just be due to noise in the data rather than a real degradation.
For example, if the results were as shown in the image below, an alert would not have fired for the first 3 checks, even though the observed impact is already above your set alert threshold after Check 2. The reason no alert fires in these earlier checks is because the error margin, or confidence interval, on the impact is too wide to be confident that the impact really is greater than your threshold. However for the fourth and fifth checks, an alert would fire.
Hence, for an alert to fire, the observed degradation will need to be a certain amount more extreme than the threshold you’ve chosen. Exactly how much more extreme it would need to be (sometimes called the Minimum Detectable Effect) depends on the power of the metric, which is influenced primarily by sample size and the variance in the metric values.
The attached sheet can be used to help you calculate what range of degradations you can expect to detect for a given sample size and set of metric characteristics (currently this sheet only supports means, or 'real valued', metrics and not proportions metrics).
For example, imagine you have a ‘Percentage of Unique Users’ metric which has a value of 60% in the baseline treatment, and you use a degradation threshold of 10%. If the desired direction of the metric is a decrease, then we would be testing for evidence that the Percentage of Unique Users in the comparison group is more than 66% (more than 10% higher than the baseline value).
Assuming a 50/50 percentage rollout of users between baseline and comparison treatments, and an Org wide significance level of 0.05, with 10,000 unique users you would only see an alert if the observed percentage for the comparison group were higher than 68%. If instead you had 1000 or 100,000 unique users, the comparison group value would need to be higher than 73% and 66.7%, respectively, for an alert to be raised.
Hence, we recommend setting an alert threshold that is less extreme than any degradation which you would definitely want to be alerted for. Chose a threshold which is close to the boundary between a safe or acceptable degradation and a degradation which you would want to know about.