This article includes best practices for running a Split Synchronizer.
By default, Split’s SDKs keep segment and feature flag data synchronized as users navigate across disparate systems, treatments and conditions. Some languages, however, do not have a native capability to keep a shared local cache of this data to properly serve treatments.
In its default mode, the Split Synchronizer coordinates the sending and receiving of data to a remote datastore (Redis) that all of your processes can share to pull data for the evaluation of treatments, acting as the cache for your SDKs. It will also post impression data and metrics generated by the SDKs back to Split’s servers, for exposure in the web console or sending to the data integration of your choice.
Optionally, you can configure the Split Synchronizer in proxy mode. Using proxy mode, you can reduce connection latencies from the SDKs to Split server in a way that is completely transparent to the SDKs.
More information on configuring the Synchronizer can be found here.
The Synchronizer runs as a standalone process in a dedicated or shared servers; it doesn’t affect the performance of your code, or Split’s SDKs. The following are best practices as it relates to running the Synchronizer.
SDKs
Alerting on CONTROL treatment: In spite of CONTROL being a known treatment, usually when an application starts to suddenly report the CONTROL treatment it's a sign of something wrong when evaluating feature flags. We recommend setting up an impressions listener that uses StatsD metrics for the CONTROL treatment. Impressions listener for Split Sync can be found here → → Listener.
Logs: We recommend reporting/alerting on errors coming from Split logs. Usually anything labeled as error, exception from the Split logs should be of concern. One way to isolate Split logs is to direct them to a custom location following the logging instructions for each SDK.
Split Sync
Split Sync is written in GoLang and is highly performant compared with JVM-based and interpreted languages. Below are relevant topics of interest when running Split Sync for a production workload.
Setup
We recommend running Split Sync under supervision, to make sure the process can be brought back up in the event of a crash.
Redis: Set up Split Sync in its own Redis database. We DO NOT recommend using the database zero (default) as it is usually used by other applications.
Hardware requirements
We've done extensive testing, but understand each environment is different. The minimum requirements we tested using Amazon AWS cloud for a production workload were:
Split Sync process: AWS EC2 m5.large/m5a.large, 2 vCPUs, 8GB RAM
Redis: AWS ElastiCache cache.m5.large, 2 vCPUs, 6GB RAM
Benchmark
The above setup was able to perform within safe limits for both CPU and memory on the Split Sync process, as well as Redis.
Total of Split in the system: 200
Impressions Refresh rate ( impressionsRefreshRate
): 10 seconds
Threads posting impressions( impressionsThreads
): 2 this is an example
Impressions per post ( impressionsPerPost
): 5,000 this is an example
Total of 120,000 posts per minute. The test concluded safely after posting 4B impressions to the Split backend. Both the Split sync process and Redis remained under safe margins for CPU and memory at all times.
Alerts
We recommend the following alerts:
Split Sync process: Keep CPU under 50% utilization to avoid any performance degradation and to prevent the Split Sync process from falling behind.
Redis: Keep CPU under 50% utilization. Memory should remain under safe limits, 60 or 70% should be ok, but make sure to monitor the rate of growth. Running at all times at 70% constant utilization could be ok, however if the rate of growth is 10% every 5 minutes that will likely be a problem, and a sign that Split Sync is not able to keep up with evicting data. If Redis memory continues to increase, try the following procedure:
- stop Split Sync gracefully to avoid losing data
- increase by two (2) the number of threads dedicated to post impressions. Config key: impressionsThreads
- and start Split Sync again
- repeat if the memory consumption remains in an increasing pattern
Alerting on CONTROL treatment can also be set at the Split Synchronizer level by setting on impression listener here. This approach is similar to the SDK as described at the top of this article, but from the Synchronizer standpoint.
Healthchecks. Use these endpoints exposed by Split Sync to monitor the process health. This is usually required when running under container orchestrators like AWS ECS, Mesos or Kubernetes. It is also useful for simple health check monitoring: primarily /admin/ping and /admin/healthcheck.
Alerting on logs
If you wish to augment alerts using log based alerts, considering the following:
During Redis errors the Split Sync will show the following lines:
connect: connection refused
For any other I/O errors you should see:
Error fetching segment
Error fetching splits
or, for a more generic error:
Error fetching
When you manually execute operations such as dropping or flushing Impressions or Events, an error will be received if another operation is running at the same time.
In Debug level, the following log will appear for flushing:
Cannot execute flush. Another operation is performing operations in Events.
and for Impressions
Cannot execute flush. Another operation is performing operations in Impressions.
In Debug level, the following log will appear for dropping:
Cannot execute drop. Another operation is performing operations in Events.
and for Impressions
Cannot execute drop. Another operation is performing operations in Impressions.
Additionally, Synchronizer performs automatically eviction for Events and Impressions. Manual and automatically eviction will not be executed at the same time. In other words, if some eviction is running, the process will skip the new operation. In Debug Level it will inform the following message:
Another task is performing operation on Events. Skipping.
and for Impressions
Another task is performing operation on Impressions. Skipping.
Webhook
If you want to track messages via Slack, you can do it by adding the webhook URL and the slack channel into the configuration of Split Synchronizer.
How you start your Synchronizer will determine how you add this parameter:
JSON | CLI PARAMETER | DOCKER ENV | TYPE | DESCRIPTION |
---|---|---|---|---|
slackChannel | -log-slack-channel | SPLIT_SYNC_LOG_SLACK_CHANNEL | string | Set the Slack channel or user to report a summary in realtime of ERROR log level. |
slackWebhookURL | -log-slack-webhook-url | SPLIT_SYNC_LOG_SLACK_WEBHOOK | string | Set the Slack webhook URL to report a summary in realtime of ERROR log level. |
Please notice that with this webhook, you will be able to track error level messages, when the Split Synchronizer starts, when the Split Synchronizer is gracefully shut it down, or if it was forced to stop.
Example of Sync started and gracefully shutdown:
Fetch Queue Size of Impressions or Events
Since version 2.1.0, there are two APIs that will tell you how many Impressions or Events are stored in Redis.
Note: It's recommended that you frequently fetch the queue values. If the size of the queue or impression counts in the queue keeps growing, then the synchronizer cannot catch up to the incoming impressions or events.
Events
- Method GET
- /api/events/queueSize
- Data out: JSON
- Example: GET /api/events/queueSize
{
queueSize: <QUEUESIZE>
}
Impressions
- Method GET
- /api/impressions/queueSize
- Data out: JSON
- Example: GET /api/impressions/queueSize
{
queueSize: <QUEUESIZE>
}
Manually flush Impressions or Events
Since version 2.1.0, if for some reason you wish to flush Impressions or Events, there are two APIs to do that. This will send information to Split Servers in batches.
For Events
- Method POST
- /api/events/flush
- Data in:
- Name: size
- Validation: 1..X
- Optional
- Query Parameter
- Example: POST /api/events/flush?size=200
- RESPONSE:
- 200: No Errors
- 400: Bad request
- 5XX: Internal Server Error
For Impressions
- Method POST
- /api/impressions/flush
- Data in:
- Name: size
- Validation: 1..X
- Optional
- Query Parameter
- Example: POST /api/impressions/flush?size=200
- RESPONSE:
- 200: No Errors
- 400: Bad request
- 5XX: Internal Server Error
Manually drop Impressions or Events
Since version 2.1.0, if for some reason you wish to flush Impressions or Events, there are two APIs to do that. This will send information to Split Servers in batches.
For Events
- Method POST
- /api/events/drop
- Data in:
- Name: size
- Validation: 1..X
- Optional (if size is not passed, it will drop all the elements)
- Query Parameter
- Example: POST /api/events/drop?size=200
- RESPONSE:
- 200: No Errors
- 400: Bad request
- 5XX: Internal Server Error
For Impressions
- Method POST
- /api/impressions/drop
- Data in:
- Name: size
- Validation: 1..X
- Optional
- Query Parameter
- Example: POST /api/impressions/drop?size=200
- RESPONSE:
- 200: No Errors
- 400: Bad request
- 5XX: Internal Server Error
How can I check if Split Synchronizer is configured properly?
Below are a set of config keys that can be used to adjust the settings for more appropriate settings if your workload demands a custom configuration. The defaults are intended to keep the memory consumption constant, even at high load. But we understand that not all workloads are similar. This guide contains instructions for measuring the capacity of the Synchronizer process and fine-tuning it if the default settings are not sufficient.
Fp: frequency per post determined by ImpressionsPostRate and EventsPostRate
Ap: amount of Impressions or Events per post, determined by ImpressionsPerPost and EventsPerPost
T# = Number of threads used for sending data
T(h) = (3,600/Fp) * Ap * T#
This is the total amount of Impressions or Events flushed per hour.
Es = Events generated per second.
Eh = Es x 3,600 (Events generated per hour)
Is = Impressions generated per second
Ih = Is x 3,600 (Impressions generated per hour)
Xh = Depending of what you are calculating, use either Ih or Eh to analyze Impressions or Events respectively. Using the previously described definitions, replace them with their values below to calculate lambda(ℷ) and analyze its value according to the description below:
ℷ = T(h) / Xh
If ℷ >= 1: the current configuration is processing Events or Impressions without keeping elements in the stack. In other words, eviction rate >= generation rate. Split Synchronizer is able to flush data as it arrives in the system from the SDKs.
If ℷ < 1: the current configuration may not be enough to process all the data coming in, and over time it may produce an always-increasing memory footprint. Recommendation: increase the number of threads or reduce the frequency for evicting elements. We recommend increasing the number of threads if they are still using the default value of 1, and to not exceed the number of cores. On the other hand, when reducing the frequency of element eviction (flush operation), decrease the value in a conservative manner by increments of ten or twenty percent each time.
Before going to the example, if you want to edit these parameters on the Synchronizer, the table below shows the equivalences. For more information visit the section Advanced Configuration under Split Synchronizer docs.
Example 1
We will calculate the performance of Impressions considering the following configuration scenario:
- Impressions Post Rate (Fp) = 60 Seconds
- Impressions Per Post (Ap) = 1,000
- Impressions generated per second (Is) = 3
- Impressions generated in one hour (Ih) = 3*3,600 = 10,800
- Number of threads (T#) = 1
Let's do some math. The total amount of impressions sent per hour is driven by:
3,600 seconds in one hour / Fp * Ap * T1 or T(h) = (3,600/60) * 1000 * 1 = 60,000
being Xh = Ih = 10,800 (how many impressions per hour).
Then our ℷ factor will be determined by
ℷ = T(h) / Xh = 60,000 / 10,800 = 5,555
- ℷ is higher than 1: The configuration above will support some peaks and send all the Impressions.
Example 2
Now let’s consider a higher number of Impressions per hour with the same configuration as before:
- Impressions Post Rate (Fp) = 60 Seconds
- Impressions Per Post (Ap) = 1,000
- Impressions generated per second (Is) = 30
- Impressions generated in one hour (Ih)= 30*3,600 = 108,000
- Number of threads (T#) = 1
Let's do more math:
- T(h) = (3,600/60) * 1000 * 1 = 60,000
- Xh = Ih = 108,000
- ℷ = T(h) / Xh = 60,000 / 108,000 = 0.55
- ℷ is less than 1: The configuration above is not enough to flush Impressions. It needs more than one hour to evict and send all the elements to the Split servers. However, the Synchronizer will continue to receive elements over the next hour. In this case, the corrective action will be either increasing the number of threads if the default is one. Then, proceed to decrease the rate of flush as indicated in the previous section.
- T# = 2 or Fp = 30 (will be good enough)
Note: Eviction could also be executed manually, but keep in mind that this is a manual task that sends 5 batches of 5,000 elements (25,000 in total per call). For this case, 2 calls to that manual eviction need to execute to evict the 48,000 pending Impressions.
Example 3
Let's try with a higher number of Impressions generated per hour:
- Impressions Post Rate (Fp) = 60 Seconds
- Impressions Per Post (Ap) = 1,000
- Impressions generated per second (Is) = 300
- Impressions generated in one hour (Ih)= 300*3,600 = 1,080,000
- Number of threads (T#) = 2
Let's do more math:
- T(h) = (3,600/60) * 1000 * 2 = 120,000
- Xh = Ih = 1,080,000
- ℷ = T(h) / Xh = 120,000 / 1,080,000 = 0.11
- ℷ is less than 1: This indicates that the process is not adequately provisioned to successfully send all generated impressions. If this is only a peak, it will take more than 8 hours to send all of the Impressions to Split Servers (120,000 impressions are sent per hour), even if using more than one thread. Manual eviction is also not the solution. In this case, editing the configuration is a better approach or strategy to follow.
T# = 5 and Fp = 15 could be an example of making sure that the generation rate will be less than the eviction rate.
Manually kill a Split when using Redis
This method overrides the content of the Split definition in Redis.
Disclaimer
This method is not recommended and should only be used as a last-resort option in the rare circumstance of a Split service outage where you need to kill the feature at that precise moment. If you execute these steps during an outage, you are responsible for reverting the change. Split Synchronizer cannot revert the feature flag on its own unless you go to Split user interface and make any change to the same feature flag where you placed the override.
First, locate the feature flag definition in Redis by connecting to the Redis console via the redis-cli. For this example, we'll use the flag featureFlagPerformanceMonitor.
get SPLITIO.split.<YOUR_SPLIT_NAME>
For example:
redis> get SPLITIO.split.featureFlagPerformanceMonitor "{\"trafficTypeName\":\"organization\",\"name\":\"featureFlagPerformanceMonitor\",\"trafficAllocation\":100,\"trafficAllocationSeed\":-1385870765,\"seed\":-1064709934,\"status\":\"ACTIVE\",\"killed\":false,\"defaultTreatment\":\"off\",\"changeNumber\":1530047266117,\"algo\":2,\"conditions\":[{\"conditionType\":\"ROLLOUT\",\"matcherGroup\":{\"combiner\":\"AND\",\"matchers\":[{\"keySelector\":{\"trafficType\":\"organization\",\"attribute\":null},\"matcherType\":\"ALL_KEYS\",\"negate\":false,\"userDefinedSegmentMatcherData\":null,\"whitelistMatcherData\":null,\"unaryNumericMatcherData\":null,\"betweenMatcherData\":null,\"booleanMatcherData\":null,\"dependencyMatcherData\":null,\"stringMatcherData\":null}]},\"partitions\":[{\"treatment\":\"on\",\"size\":100},{\"treatment\":\"off\",\"size\":0}],\"label\":\"default rule\"}]}"
find the section containing the "killed\":false portion and replace it with "killed\":true
"{\"trafficTypeName\":\"organization\",\"name\":\"featureFlagPerformanceMonitor\",\"trafficAllocation\":100,\"trafficAllocationSeed\":-1385870765,\"seed\":-1064709934,\"status\":\"ACTIVE\",\"killed\":true,\"defaultTreatment\":\"off\",\"changeNumber\":1530047266117,\"algo\":2,\"conditions\":[{\"conditionType\":\"ROLLOUT\",\"matcherGroup\":{\"combiner\":\"AND\",\"matchers\":[{\"keySelector\":{\"trafficType\":\"organization\",\"attribute\":null},\"matcherType\":\"ALL_KEYS\",\"negate\":false,\"userDefinedSegmentMatcherData\":null,\"whitelistMatcherData\":null,\"unaryNumericMatcherData\":null,\"betweenMatcherData\":null,\"booleanMatcherData\":null,\"dependencyMatcherData\":null,\"stringMatcherData\":null}]},\"partitions\":[{\"treatment\":\"on\",\"size\":100},{\"treatment\":\"off\",\"size\":0}],\"label\":\"default rule\"}]}"
Now execute the redis command "set" to set the new content:
redis> set SPLITIO.split.featureFlagPerformanceMonitor "{\"trafficTypeName\":\"organization\",\"name\":\"featureFlagPerformanceMonitor\",\"trafficAllocation\":100,\"trafficAllocationSeed\":-1385870765,\"seed\":-1064709934,\"status\":\"ACTIVE\",\"killed\":true,\"defaultTreatment\":\"off\",\"changeNumber\":1530047266117,\"algo\":2,\"conditions\":[{\"conditionType\":\"ROLLOUT\",\"matcherGroup\":{\"combiner\":\"AND\",\"matchers\":[{\"keySelector\":{\"trafficType\":\"organization\",\"attribute\":null},\"matcherType\":\"ALL_KEYS\",\"negate\":false,\"userDefinedSegmentMatcherData\":null,\"whitelistMatcherData\":null,\"unaryNumericMatcherData\":null,\"betweenMatcherData\":null,\"booleanMatcherData\":null,\"dependencyMatcherData\":null,\"stringMatcherData\":null}]},\"partitions\":[{\"treatment\":\"on\",\"size\":100},{\"treatment\":\"off\",\"size\":0}],\"label\":\"default rule\"}]}"
After this change is made, your SDKs connected to Redis will automatically serve the default treatment for this feature flag.
If you introduce any other change than the indicated above and Split SDK cannot properly parse the feature flag definition, the CONTROL treatment will be returned.
Remember to make a change in the UI to let the Split synchronizer revert this manual override. If you don't do that, your feature flag will remain in state killed and the Split support team does not have a tool to find it.
Upgrades
- Stop Split Sync gracefully
- Upgrade Split Sync binary
- Start the service again
- Watch the logs for a short period of time to make sure no warnings arise.
Comments
0 comments
Please sign in to leave a comment.