Reviewing metrics
With Split, it is easy to check the change in a metric at any time; but to determine that the observed change represents a meaningful difference in the underlying populations, one first needs to collect a sufficient amount of data. If you look for significance too early or too often, you are guaranteed to find it eventually.
Similarly, the precision at which we can detect changes is directly correlated with the amount of data collected, so evaluating an experiment’s results with a small sample introduces the risk of missing a meaningful change. The target sample size at which one should evaluate the experimental results is based on what size of effect is meaningful (the minimum detectable effect), the variance of the underlying data, and the rate at which it is acceptable to miss this effect when it is actually present (the power).
The Experimental review period feature is intended to help avoid reaching conclusions before taking into account how long an experiment should run. You can find more information on review periods and when metric cards are updated in this article.
Changing an experiment
During the course of an experiment the metrics and review period will reset any time you make a change to the feature flag. This ensures that you are evaluating your metrics based on a consistent distribution of the population. When the distribution changes, your experiment resets.
If you change a metric during a running experiment the metric card will show a message saying that we have no data for that card. The next time the calculations are made for that experiment (the frequency depends on the age of the experiment) the card will be updated to reflect the new metric definition. Versions of the experiment that have already been completed will not get recalculated.
Ramp plans
You will want to take these things into consideration when you develop your ramp plan. For percentage-based rollouts it is recommended that you start with a debugging phase: aimed at reducing risk of obvious bugs or bad user experience. The goal of this phase is not to make a decision, but to limit risk; therefore, there is no need to wait at this phase to gain statistical significance. Ideally, a few quick ramps—to 1%, 5%, or 10% of users—each lasting a day, should be sufficient for debugging.
It's during the maximum power ramp (MPR) phase that you'll want to hold your experiment for, in most cases, at least a week. This phase gives the most statistical power to detect differences between treatment and control. For a two-variant experiment (treatment and control), a 50/50 distribution of all users is the MPR. For a three-variant experiment (two treatments and control) MPR is a 33/33/34 distribution of all users. You should spend at least a week on this step of the ramp to collect enough data on treatment impact.
If the targeting is more complex, you may want to use Traffic allocation as a way of moving from risk mitigation to MPR. This could avoid the need to make small discrete changes to the targeting rules.
You may have further phases to test scalability, or perhaps to hold out a small percentage of users to understand the long term impact.
Since any change to the experiment will trigger a reset, one best practice is to create a segment for the individual target for each treatment. This allows you to add and delete users from the individual targets without modifying the experiment.
Comments
0 comments
Please sign in to leave a comment.