Why run an A/A test?
Everyone understands the value of A/B testing an application, where one compares the relative effects of one or more different treatments on the users’ behavior. But why on earth would you want to run an A/A test, which is where you compare the behavior of two randomly selected groups of users that both get the same treatment? The primary reason is to validate your experimental setup. If the results of an A/A test show consistent statistically significant differences between the behaviors of two groups of users exposed to the same treatment, then there is probably something wrong with your targeting or telemetry that should be addressed before trusting the results of any subsequent A/B tests.
A second benefit of an active A/A test is to collect baselines for all your metrics of a particular traffic type. Since the A/A test will presumably apply to all users, averaging together the metric values for the two groups will give you metric baselines across your entire application or site.
Prerequisites for running an A/A test
- Events supporting desired metrics are being sent to Split
- Application incorporates the Split SDK
- Desired metrics are created. This is not an absolute requirement, as metrics can be created at any time during an experiment and will be calculated using all appropriate, available events. But in most cases, you will at least want to have some set of metrics created.
Running the A/A test
Running an A/A test is simple. First, decide for which traffic type you wish to run the test. Best practice would be to run an A/A test for every traffic type for which you are running experiments. Create the feature flag with a default rule specifying a percentage-based rollout. For an A/A test, this would be 50/50, for an A/A/A test, 33/33/34.
The names of the treatments are inconsequential, so using the defaults of on and off is fine.
Once the feature flag is created (named, for example, aa_test), you need only add a getTreatment call for that feature flag, passing the appropriate key for the feature flag's traffic type. The placement of that call depends somewhat on the traffic type. For anonymous, the call should be on the visitor’s first contact with the application or site. For a traffic type of known user, the getTreatment call should be made on login or as soon as the user can be identified. If using one of the client side SDKs you can add the getTreatment call right after you initialize it for the user. Just adding the call is sufficient; there is no need to store the returned value, as the goal is to have the returned treatment have zero effect on the user’s experience. An impression will be generated and sent to Split, so the analytics software will know the treatment assigned to a user.
Analyzing the results
While seasonality would not be expected to affect the results of an A/A test, one of the goals here is to identify unexpected outcomes, so a recommended best practice would be to run the test for at least a week before initially assessing the outcome. At the end of a week, what are you hoping to see on your metric cards?
Because, ideally, there is no difference between the on and off treatment groups you would expect to see only minor effects between the treatments and no conclusions of statistical significance. Something like this:
Does this mean that if you see a statistically significant metric that something went wrong and your telemetry is suspect? No. It’s not intuitively obvious, but the p-value calculated for any particular metric comparison for an A/A test is equally likely to be any value between zero and one. So for any given metric, there is a 5% chance of a p-value 0.05 or less (the default threshold for statistical significance in Split). Thus, if you are looking at 20 or more metrics, there’s a good chance that one of them will have such a p-value and be classified as statistically significant.
Here is a chart of p-values calculated for 8 metrics at 9 different times over the course of an A/A test.
Remember that checking the p-value multiple times increases the chance of seeing a p-value within the range of statistical significance. Note that over the life of the A/A test two different metrics show p-values less than or equal to 0.05. Overall, three of the 72 data points are at or below 0.05. This is within the realm of expectation, but if you saw three out of twenty metrics were statistically significant at the same observation point, you might want to investigate further. Were those three metrics based on the same event? If so, then this might not be unexpected.
With metrics using unique events, in an A/A test you would expect to see a random distribution of p-values at any given observation point. If you don’t, you’ll need to do some further analysis to see why bias is creeping into your results. For instance was there a Sample Ratio Mismatch (SRM) that might have skewed results?
Once you’ve added an A/A test to your site or application, it’s a good idea to leave it running to ensure that bias or telemetry issues don’t creep in somehow. For instance, if you add new events and metrics to the system, checking them in your A/A test can give you confidence that all is well.
As mentioned above, it’s a good practice to have a separate A/A test for every traffic type being used for experimentation, since randomization, event reporting, and available metrics differ for each.
Finally, if you are deploying an experiment targeting a specific population, it couldn’t hurt to add an A/A test with that same specific targeting rule prior to deploying your A/B test, just to make sure that randomization and telemetry are working properly for that particular population.