Split imposes a strict policy where any changes to a feature flag's definition creates a new version and restarts metrics calculation with a clean slate for that new version. For some cases, such as adding a completely new targeting rule, it is clear why that is a good policy. But one area where it may not be immediately obvious why, statistically speaking, that that is the right approach, is with a gradual rollout, where you start with, say, a 10/90 distribution and then move to 50/50 either in one fell swoop or via a series of incremental changes. This article discusses the problems that could arise by combining data from both phases and analyzing it as one set.
There are a couple of different ways we could go about analyzing data from the two different definitions.
Assuming the 10/90 and the 50/50 versions ran for a week each, so that seasonality was not a concern, then Split could:
- Treat the two versions as separate sets of users, so users who were in both versions would be double counted. This is a problem because loyal or frequently returning users are more likely to be in both versions so they'd have a higher impact on the results than they should. It also makes the users not independent if the same user is counted twice.
- Treat it as a single test, so the same user key is counted as one individual user, and exclude from the calculations users who switch treatments (as Split currently does). In this case, frequently returning users are more likely to be excluded, biasing the results towards non-returning users. This problem is more pronounced if the treatment impacts how likely a user is to return.
- Treat it as a single test, but rather than excluding users who switch treatments, attribute events to the last treatment the user saw prior to the event. This is a problem because there will be some users who have more time to do an event. If you look at a metric like the fraction of users who convert, you'd often expect that metric to naturally increase over time unless the event always happens very soon after the first visit. In the 10% on / 90% off -> 50/50 example (assuming for simplicity that everyone returns) 10% of the off users have 2 weeks in on, 40% of users have 1 week in off and 1 week in on, and 50% of users have 2 weeks in off. So because there are more users with the full 2 weeks seeing the off treatment than full 2 week users seeing the on treatment - the users seeing off have longer on average to convert. This effect could be large enough that the off treatment would look better even for an A/A test.
- Treat it as a single test, do not exclude users who switch treatments, and attribute events to the final treatment the user saw. This brings up a bunch of issues because you don't really know which treatment caused the event, and you cannot rely on the results being an accurate indicator of what you would see if you put that change into production.
If there are no returning users, and the events on which metrics are based are all expected to happen very soon after the user visits then there might not be a problem. There might not be any issues if the experimental unit is a session for example, but in nearly all real world cases the issues above would be present in one way or another.
So rather than take the risk of calculating results that have a good chance of misleading you, Split takes the high statistical road and resets the metrics so each version of a feature flag begins with a clean slate.
Article is closed for comments.