Whether you are releasing new functionality or running an experiment, Split analyzes the change in your metrics. This helps you determine whether the impact is positive and statistically significant, not simply happening by chance. You can ensure the analyses are run in the way which best suits your use cases by configuring your statistical settings. For example, you can adjust parameters such as the significance threshold. This controls your chances of seeing false positive results. You can set it higher for a faster time to significance, or lower to filter more reliable results, which reflects your preferred balance between confidence and time.
There are two types of statistical settings:
- Monitor settings, which impact how your alert policies are triggered
- Experiment settings, which impact how your metric impact results are analyzed.
You can set organization-wide defaults by configuring your organization's statistical settings in the Admin Settings section. These settings are used to trigger all alert policies and uncustomized splits, and they are applied by default for all newly created splits. You can also configure the experiment threshold of your splits individually by customizing their settings. The settings used for a particular environment's results within a particular split can be individually customized.
When you customize your experiment settings at the split-level, the new settings are immediately applied to all versions of the split in the particular environment you customized only. For example, you could set a particular split to use different experiment settings in the staging and production environments if desired. Customizing experiment settings at the split level will have no impact on your other splits or any of your alert policies.
Monitor settings can only be configured at the organizational level. This is to ensure each alert policy is always analyzed against the same statistical settings, maintaining consistency across any alerts that may be raised.
Split allows you to configure how long you would like your metrics to be monitored for and alert you if a severe degradation has occurred. By default, the monitoring window is set to 24 hours from a split version change. You can select from a range of different monitoring windows , from 30 minutes to 28 days.
With configurable monitoring windows you can customize your monitoring period based on your team's release strategy. Adjust your monitoring window to 24 hours if you are turning on a feature at night with low traffic volumes and you want to monitor through the morning when traffic begins to increase or to 30 minutes if you are expecting high traffic volumes within the first 30 minutes of a new split version. To learn about choosing your degradation threshold based on your expected traffic, refer to Choosing your degradation threshold for alerting for more information.
Monitor significance threshold
The monitor significance threshold limits your chances of receiving a false alert. A lower significance threshold means we wait until there is more evidence of a degradation before firing an alert. Therefore, a lower significance threshold reduces the chance of false alerts, but this comes at the cost of increasing the time it takes for an alert to fire when a degradation does exist.
A commonly used value for the monitor significance threshold is 0.05 (5%), which means that, for each alert policy and for each version update, there is at most a 5% chance of seeing an alert when the true difference between the treatments is less than the degradation threshold set up in your metric’s alert policy.
You can configure the monitor significance threshold independently from the default significance threshold used for calculating your metric results. Changing this setting only impacts your monitoring alerts and not the metric results.
Statistical approach used for monitoring window
For alert policies, rather than testing for statistically significant evidence of any impact as we do for our standard metric analyses, we test for significant evidence of an impact larger than your chosen degradation threshold, in the opposite direction to the metric’s desired direction.
In order to control the false positive rate during the monitoring window we adjust the significance threshold that the p-value must meet before an alert is fired. We divide the threshold by the number of times we will check for degradations during the selected monitoring window. For example, if your monitoring window is 30 minutes, we estimate that we will run 5 calculations during that time. In this case, if your monitor significance threshold is set to 0.05 in your statistical settings, the p-value would need to be below 0.01 (0.05 / 5) for an alert to fire in this time window.
This adjustment allows us to control the false positive rate and ensure that the likelihood of getting a false alert, across the whole of the monitoring window, is no higher than your chosen monitor significance threshold. The level of adjustment is dependent on the duration of the monitoring window and how many calculations will run during that time.
This adjustment means that a longer monitoring window will have slightly less ability to detect small degradations at the beginning of your release or rollout, but in most cases this will be far outweighed by the gain in sensitivity due to the larger sample size you accrue over a longer window.
Default significance threshold
The significance threshold is a representation of your organization's risk tolerance. Formally, the significance threshold is the probability of a given metric calculation returning a statistically significant result when the null hypothesis is true (i.e. when there is no real difference between the treatments for that metric).
A higher significance threshold will allow you to reach statistical significance faster when a true difference does exist, but it will also increase your chances of seeing a false positive when no true difference exists. Conversely, a lower significance threshold will reduce your chances of seeing false positive results but you will need a larger difference between the two treatments, or a larger sample size, in order to reach statistical significance.
A commonly used value for the significance threshold is 0.05 (5%). With this threshold value, a given calculation of a metric where there was no true impact has a 5% chance of showing as statistically significant (i.e. a false positive). If Multiple Comparison Corrections have been applied it will mean there is a 5% chance of a statistically significant metric being a false positive.
Minimum sample size
The minimum number of samples required in each treatment before we will calculate statistical results for your metrics. This number must be at least 10, for most situations we recommend using a minimum sample size of 355.
For the t-test used in Split's statistics to be reliable, the data must follow an approximately normal distribution. The central limit theorem (CLT) shows that the mean of a variable has an approximately normal distribution if the sample size is large enough.
You can reduce the default minimum sample size of 355 if you need results for smaller sample sizes. For metrics with skewed distributions your results may be less reliable when you have small sample sizes.
Note that this parameter does not affect your monitoring alerts. For monitoring we always require a minimum sample size of 355 in each treatment before we will fire an alert.
Power measures an experiment's ability to detect an effect, if possible. Formally, the power of an experiment is the probability of rejecting a false null hypothesis.
A commonly used value for statistical power is 80%, which means that the metric has 80% chance of reaching significance if the true impact is equal to the minimum likely detectable effect. Assuming all else is equal, a higher power will increase the recommended sample size needed for your split. In statistical terms, the power threshold is equivalent to 1 - β.
Experimental review period
The experimental review period represents a period of time where a typical customer visits the product and completes the activities relevant to your metrics. For instance, you may have different customer behavior patterns during the course of the week or on the weekends (set a seven day period).
A commonly used value for experimental review period is at least 14 days to account for weekend and weekly behavior of customers. Adjust the review period to the most appropriate option for your business, you will be able to select 1,7,14 or 28 days.
Multiple comparison corrections
Analyzing multiple metrics per experiment can substantially increase your chances of seeing a false positive result if not accounted for. Our multiple comparison corrections feature applies a correction to your results so that the overall chance of a significant metric being a false positive is never larger than your significance threshold. For example, with the default significance threshold of 5%, you can be confident that at least 95% of all the changes without meaningful impacts don't incorrectly show as statistically significant. This guarantee applies regardless of how many metrics you have.
With this setting applied, the significance of your metrics, and their p-values and error margins, will automatically be adjusted to include this correction. This correction will be immediately applied to all tests, including previously completed ones. Refer to Multiple comparison corrections guide to learn more.
Recommendations and trade-offs
Be aware of the trade-offs associated with changing the statistical settings. In general, a lower significance threshold increases the number of samples required to achieve significance. Increasing this setting decreases the number of samples and the amount of time needed to declare significance, but may also increase the chance that some of the results are false positives.
As best practice, we recommend setting your significance threshold to between 0.01 and 0.1. In addition, we recommend an experimental review period of at least 14 days to account for weekly use patterns.
Navigate to Admin Settings > Monitor window and statistics. After you adjust your settings, click Save.
Changing your statistical settings instantly affects your entire organization. All alert policies will be analyzed against these new settings in future. All splits which are not customized (i.e. those for which the "Always use organization wide settings" checkbox is checked in their Experiment Settings) will also be analyzed using these new settings.
For example, if your experiment is showing metrics as having a statistically positive impact at a .05 significant threshold, and you change your significance threshold from 0.05 to 0.01, the next time you load your metrics impact page you may see that metrics are no longer marked as having a significant impact.
If you want to customize the settings of a split you're viewing, navigate to Metrics impact > Experiment settings. Make sure that you have the right environment selected, or change the selection in the environment drop down in the top section in order to customize a different environment.
To customize the settings, make sure the "Always use organization wide settings" checkbox isn't selected, otherwise the organization-wide settings is used to analyze results.
After you adjust your settings, click Save.
Changing the experiment settings for a split instantly affects all versions of that split in the environment for which you customized the settings. Versions in other environments, other splits in your organization, and all alert policies aren't affected by your customization.
Customizations apply to specific environments; if your split has multiple environments each must be customized separately.
When you select the "Always use organization wide settings" checkbox, the settings for this split update whenever your organization wide settings are changed. When it's not selected, the settings no longer reflect any changes to your organization wide settings.
Note: This feature is in beta. If you'd like to be included in the beta, contact email@example.com.
Split’s Dimensional Analysis allows you to dissect your experimentation data at a granular level, enabling better informed future hypotheses or experiments. By leveraging your event property data across all sources, you can create a set of dimensions that are used to gain additional context within your experimentation data.
This feature provides information into the impacts of your key metrics, and ways to learn from the results of your experiments. This allows you to iterate by understanding what underlying behavior could be driving your top line metrics. Once you have that understanding, you can decide what actions to take next for upcoming hypotheses or additional experiments, and review underlying trends that cause an expected or unexpected behavior.
Note: To use this feature, you must enable the experimentation package. Contact firstname.lastname@example.org for more information.
Before you start
In order to use this feature, you may have to take the following extra steps to set up your workflow to ensure your data is valid. To do so, you need to make sure your event property and values match those in DataHub or your data may not be calculated properly. To ensure your data is valid, go to the Data hub, and from the Live tail tab, run your query for your events to get the right event properties. For more information on how to run a query, refer to the Query events section of the Live tail guide.
What is a dimension
Dimensions are parameters, attributes, and characteristics of data (e.g., a user, event, product, etc.) that provide context to your data such as groups of users, categories of products, etc.
How it works
Split leverages your event property data across all sources to develop a set of dimensions to break down your data. Even if you send event data from your application and another source (e.g., Segment or S3) if there’s consistency within your event and property naming, Split handles the attribution to calculate your metrics appropriately.
You can configure which event properties you want to set as a dimension for your organization. For each dimension, you can select an event property and set the values Split is going to review and attribute accordingly. Once you configure these dimensions, Split periodically reviews event data streams, identifies any unique property values for those dimensions specified, and calculates your metrics based upon attributed activity to these unique event property values.
Configuring your dimensions and values
You can configure dimensions at the Admin level. You have a limit of five dimensions per organization and five values per dimension. To configure your dimensions, do the following:
- From the left hand navigation, click the icon at the top and select Admin settings.
Under Experimentation settings, click Metric dimensions. The Metric dimensions table page appears.
- Click the Add dimension button. The Add a dimension area appears.
- Select the desired event property. We recommend using simple categories (e.g., device types and browsers) or binary variables (e.g., true or false).
Important: Make sure your event property and values match the ones in Data hub or your data may not be calculated properly. Also be aware that event property values are case-sensitive. For example,
chrome are different values.
- Either select or enter the event property values you want to use for this dimension. You can have up to five property values to calculate and graph in the user interface.
- Click the Add button to complete your dimension configuration.
- Values not selected as key metrics are also calculated as part of an overall calculation for all metrics, but are not eligible for more granular dimensional analysis.
- Only values from the last 90 days are included in the dimensionality admin menu. These dimensions are calculated for key metrics only.
- If your organization needs a higher limit, contact email@example.com.
View your dimensions in the Impact snapshot graph. This graph provides you an up-to-date, aggregated view of the expected impact over baseline for each treatment and an estimated range for that impact. For more information about the graph, refer to the Line chart section of the Metric details and trends guide.
To delete a dimension, do the following:
- From the Metrics dimensions table, click Delete on the dimension you want to delete. The Delete dimension view appears.
- Type DELETE in the field and click the Delete button. The dimension is deleted and the dimension list updates.
This deletes the connection between the property and the grouping that you defined with the dimension, not any event properties.
Example use case for dimensional analysis
Let’s say you have an A/B test for a new checkout flow in an e-commerce site. After running an experiment, the conversion rate stays flat between test and control groups. When you sequence users, you may think that desktop users convert at a higher rate in the new flow, whereas mobile users do not respond well to new flow. You do not intend to rollback to the old flow, so we rollout the new flow and use this analysis to iterate on a more optimized flow for mobile users.
Attributing events to dimensions
We attribute events to the last seen dimension property. For example, let’s assume a customer has defined a dimension "country" with two values, "USA" and "Canada", and one of its users has the following sequence of attributed events:
For a metric tracking checkout event, we attribute the second and the third checkouts to the USA and the fifth to Canada. However, we won't attribute the first checkout to any dimension slice because there are no prior events with this dimension property. We also won't attribute the fourth checkout to any dimension slice because the tracked dimension values do not include Mexico.