Whether you are releasing new functionality or running an experiment, Split is constantly analyzing the change in your customer metrics to determine whether the impact is statistically conclusive and not simply happening by chance.
By configuring your organization's statistical settings, you can set an organization-wide default significant threshold where you want Split to mark your metrics impact as significant, and the default power threshold where you want Split to mark your metrics as having received enough samples.
Significance threshold is a representation of your organization's risk tolerance. Formally, the significance threshold is the probability of detecting a false positive.
A commonly used value for the significance threshold is 0.05 (5%), which means that when there is no real difference between the performance of the treatments, there is a 5% chance of observing a statistically significant impact (a false positive). In statistical terms, the significance threshold is equivalent to alpha (α).
Significance threshold for Monitoring
When monitoring your alert policies to detect any significant degradations, we conduct multiple statistical tests during the monitoring window in order to minimize the time it takes for an alert to fire. To ensure there is not an inflated false positive rate due to this repeated testing, we adjust the significance threshold used for each individual test in a way which ensures the overall probability of detecting a false positive is not higher than your chosen significance threshold.
Power measures an experiment's ability to detect an effect, if possible. Formally, the power of an experiment is the probability of rejecting a false null hypothesis.
A commonly used value for statistical power is 80%, which means that the metric has 80% chance of reaching significance if the true impact is equal to the minimum likely detectable effect. Assuming all else is equal, a higher power will increase the recommended sample size needed for your split. In statistical terms, the power threshold is equivalent to 1 - β.
Experimental review period
The experimental review period represents a period of time where a typical customer visits the product and completes the activities relevant to your metrics. For instance, you may have different customer behavior patterns during the course of the week or on the weekends (set a seven day period), or have a 30-day sales cycle (set a four week period).
A commonly used value for experimental review period is at least 14 days to account for weekend and weekly behavior of customers. Adjust the review period to the most appropriate option for your business, between 7 and 30 days.
Recommendations and trade-offs
Be aware of the trade-offs associated with changing the statistical settings for your organization. In general, a higher significance and power threshold increases the number of samples required to achieve significance. Lowering these settings decreases the number of samples and the amount of time needed to declare significance, but may also increase the chance that some of the results are false positives.
As best practice, we recommend setting your significance threshold to between 0.01 and 0.1 and your power threshold to between 80% and 95%. In addition, we recommend an experimental review period of at least 14 days to account for weekly use patterns.
Navigate to Admin Settings >Statistical Settings. After you adjust your settings, click Save.
Changing your statistical settings instantly affects your entire organization and all analysis. If your experiment is showing metrics as having a statistically positive impact at a .05 significant threshold, and you change your significance threshold from 0.05 to 0.01, the next time you load your metrics impact page you may see that metrics are no longer marked as having a significant impact.