Whether you are releasing new functionality or running an experiment, Split is constantly analyzing the change in your customer metrics to determine whether the impact is statistically conclusive and not simply happening by chance.
By configuring your organization's statistical settings, you can set organization-wide defaults such as the significance threshold, which controls your chances of seeing false positive results.
Monitor Window Settings
Split allows you to configure how long you would like your metrics to be monitored for and alert you if a severe degradation has occurred. By default the monitoring window will be set to 24 hours from a split version change. You will be able to select from a range of different monitoring windows , from 30 minutes to 14 days.
With configurable monitoring windows you can customize your monitoring period based on your teams release strategy. Adjust your monitoring window to 24 hours if you are turning on a feature at night with low traffic volumes and you want to monitor through the morning when traffic begins to increase or to 30 minutes if you are expecting high traffic volumes within the first 30 minutes of a new split version. Find out more about choosing your degradation threshold based on your expected traffic here.
Statistical Approach used for Monitoring Window
For alert policies, rather than testing for statistically significant evidence of any impact as we do for our standard metric analyses, we test for significant evidence of an impact larger than your chosen degradation threshold, in the opposite direction to the metric’s desired direction.
In order to control the false positive rate during the monitoring window we adjust the significance threshold that the p-value must meet before an alert is fired. We divide the threshold by the number of times we will check for degradations during the selected monitoring window. For example, if your monitoring window is 30 minutes, we estimate that we will run 5 calculations during that time. In this case, if your significance threshold is set to 0.05 in your statistical settings, the p-value would need to be below 0.01 (0.05 / 5) for an alert to fire in this time window.
This adjustment allows us to control the false positive rate and ensure that the likelihood of getting a false alert, across the whole of the monitoring window, is no higher than your chosen significance threshold. The level of adjustment is dependent on the duration of the monitoring window and how many calculations will run during that time.
This adjustment means that a longer monitoring window will have slightly less ability to detect small degradations at the beginning of your release or rollout, but in most cases this will be far outweighed by the gain in sensitivity due to the larger sample size you accrue over a longer window.
Significance threshold is a representation of your organization's risk tolerance. Formally, the significance threshold is the probability of detecting a false positive.
A commonly used value for the significance threshold is 0.05 (5%), which means that when there is no real difference between the performance of the treatments, there is a 5% chance of observing a statistically significant impact (a false positive). In statistical terms, the significance threshold is equivalent to alpha (α).
Power measures an experiment's ability to detect an effect, if possible. Formally, the power of an experiment is the probability of rejecting a false null hypothesis.
A commonly used value for statistical power is 80%, which means that the metric has 80% chance of reaching significance if the true impact is equal to the minimum likely detectable effect. Assuming all else is equal, a higher power will increase the recommended sample size needed for your split. In statistical terms, the power threshold is equivalent to 1 - β.
Experimental review period
The experimental review period represents a period of time where a typical customer visits the product and completes the activities relevant to your metrics. For instance, you may have different customer behavior patterns during the course of the week or on the weekends (set a seven day period).
A commonly used value for experimental review period is at least 14 days to account for weekend and weekly behavior of customers. Adjust the review period to the most appropriate option for your business, you will be able to select 1,7,14 or 28 days.
Multiple Comparison Corrections
Analyzing multiple metrics per experiment can substantially increase your chances of seeing a false positive result if not accounted for. Our multiple comparison corrections feature applies a correction to your results so that the overall chance of a significant metric being a false positive will never be larger than your significance threshold. For example, with the default significance threshold of 5%, you can be confident that at least 95% of all of your statistically significant metrics reflect real, meaningful impacts. This guarantee applies regardless of how many metrics you have.
With this setting applied, the significance of your metrics, and their p-values and error margins, will automatically be adjusted to include this correction. This correction will be immediately applied to all tests, including previously completed ones. Learn more about our multiple comparison corrections here.
Recommendations and trade-offs
Be aware of the trade-offs associated with changing the statistical settings for your organization. In general, a higher significance threshold increases the number of samples required to achieve significance. Lowering this setting decreases the number of samples and the amount of time needed to declare significance, but may also increase the chance that some of the results are false positives.
As best practice, we recommend setting your significance threshold to between 0.01 and 0.1. In addition, we recommend an experimental review period of at least 14 days to account for weekly use patterns.
Navigate to Admin Settings >Statistical Settings. After you adjust your settings, click Save.
Changing your statistical settings instantly affects your entire organization and all analysis. If your experiment is showing metrics as having a statistically positive impact at a .05 significant threshold, and you change your significance threshold from 0.05 to 0.01, the next time you load your metrics impact page you may see that metrics are no longer marked as having a significant impact.