We take rigor around statistical testing very seriously. The following are some highlights of the platform as it relates to conducting online digital experiments.
A frequentist approach
Why frequentist vs Bayesian?
Bayesian analysis requires a well-formed prior; which either requires additional work from users who are often not trained to form those priors or an automated approach that has a bias to a lack of change and can give slower results than the frequentist approach. Another advantage of the frequentist system is that we can share our data clearly with customers to follow the work in a way that Bayesian analysis isn’t as easily replicated (especially at scale), and that we have been able to leverage improvements pioneered by industry leaders in product experimentation at companies like LinkedIn and Microsoft who also follow frequentist approaches.
2-tailed t-tests for experimentation
Using 2-tailed t-tests allows you to detect significance in both directions (positive & negative). This test allows Split to calculate the impact and gain a computed p-value. Unlike Bayesian, where you will always get an answer, our platform will inform you if more data is needed to arrive at a statistically significant impact. Read more about how we test for Type 1 and Type 2 errors in our Documentation.
We use the Welch's t-test, or the “unequal variances t-test.” Unlike the traditional Student's t-test, this test does not assume that the variances of the two samples are equal. This version of the test makes the results more accurate when your samples have unequal variances or unequal sample sizes. For cases when the variances of the samples are equal the two tests will give the same results.
You may notice a difference between what you see in Split and the outputs of external calculators, as those tools often use the traditional Student's t-test, which is not appropriate for most real-world datasets.
Configurable significance and power thresholds
The significance threshold is a representation of your organization's risk tolerance. Formally, the significance threshold is the probability of detecting a false positive.
A commonly used value for the significance threshold is 0.05 (5%), which means that when there is no real difference between the performance of the treatments, there is a 5% chance of observing a statistically significant impact (a false positive). In statistical terms, the significance threshold is equivalent to alpha (α).
Power measures an experiment's ability to detect an effect, if possible. Formally, the power of an experiment is the probability of rejecting a false null hypothesis.
A commonly used value for statistical power is 80%, which means that the metric has an 80% chance of reaching significance if the true impact is equal to the minimum likely detectable effect. Assuming all else is equal, a higher power will increase the recommended sample size needed for your split. In statistical terms, the power threshold is equivalent to 1 - β.
Guardrail checks
Review Period Check - This is a configuration within Split where you can specify how long a test should run to account for the seasonality which may exist in your data.
Sample Ratio Mismatch Check - This is meant to detect sampling bias in your randomization by ensuring the distributions match the target rules within a reasonable confidence interval.
When conducting its sample ratio check, Split compares the calculated p-value against a threshold of 0.001. This threshold was determined based on the constant and rigorous monitoring performed on the accuracy of our randomization algorithms, and to minimize the impact a false positive would have on the trust of experimental results.
Attribution and Exclusion
Split utilizes a well-documented and tested attribution and exclusion algorithm. This has the following benefits:
- You can ingest data from any data source for evaluation. By using your data you can always be confident in its integrity and accuracy.
- You can send in data to Split after an experiment is already running. Oftentimes, you might have already tracked some type of user action (e.g. clicks on a navigation bar) but might not have fed that data into Split ahead of running an experiment. You don't need to set attribution based on the time an event is sent to us, the timestamp of the events you send is when the data was logged. This allows you to send data after events have already occurred and attribute them to experiments by matching the timeframes using the time field for when your application logged that data.
- You can define a metric in Split after an experiment is already running. Similar to the scenario above, you might also have data that you tracked during an experiment, but haven't yet defined a metric for in Split. As long as Split has the events tied to a metric in our system, our system allows you to define a metric at any time during the experiment, even after you've started running the test. On the next run of the calculation job, the system will calculate the impact of your experiment on that new metric from when the experiment began regardless of when you defined the metric.
Minimum Sample Size
A common question: how did we arrive at the minimum sample size of 355?
Split arrived at a minimum sample size based on a Microsoft paper and has become a general rule of thumb. From Split's documentation on the minimum sample size:
Normal Distributions
Robust experiments rely on the means of treatment and control groups, which are assumed to be normally distributed. The central limit theorem (CLT) shows that the mean of a variable has an approximately normal distribution if the sample size is large enough. We apply the rule of thumb that the minimum number of independent and identically distributed observations needed to safely assume that the means have a normal distribution is 355 for each treatment. Hence, we require a sample size of at least 355 in each treatment before we calculate significance for your metrics.
Comments
0 comments
Please sign in to leave a comment.