This article includes best practices for running a Split Synchronizer v5.0.0. By default, Split’s SDKs keep segment and feature flag data synchronized as users navigate across disparate systems, treatments and conditions. However, some languages don’t have a native capability to keep a shared local cache of this data to properly serve treatments.
The Split Synchronizer coordinates the sending and receiving of data to a remote datastore (Redis) that all of your processes can share to pull data for the evaluation of treatments, acting as the cache for your SDKs. It also posts impression data and metrics generated by the SDKs back to Split’s servers, for exposure in the Split user interface or sending to the data integration of your choice.
For more information on configuring the Synchronizer, refer to the Split synchronizer guide.
The synchronizer runs as a standalone process in dedicated or shared servers, which doesn’t affect the performance of your code or Split’s SDKs. The following are best practices as it relates to running the Synchronizer.
SDKs
Alerting on CONTROL treatment In spite of CONTROL being a known treatment, when an application starts to suddenly report the CONTROL treatment, it's a sign of something wrong when evaluating splits. We recommend setting up an impressions listener that uses StatsD metrics for the CONTROL treatment. For information about impressions listener for Split Sync refer to the Split Synchronizer guide.
Logs. We recommend reporting and alerting on errors coming from Split logs. Anything labeled as error or exception from the Split logs should be of concern. One way to isolate Split logs is to direct them to a custom location following the logging instructions for each SDK.
Split Sync
Split Sync is written in GoLang and is highly performant compared with JVM-based and interpreted languages. The following are relevant topics of interest when running Split Sync for a production workload.
Setup
We recommend running Split Sync under supervision, to make sure the process can be brought back up in the event of a crash.
Redis: Set up Split Sync in its own Redis database. We DO NOT recommend using the database zero (default) as it is usually used by other applications.
Resiliency with Redis: As Redis and the Split Synchronizer add additional services to your infrastructure, be sure to properly configure for situations where Redis may be unavailable. The individual SDKs have their own Redis timeout configuration, which needs to be configured to a number appropriate to your infrastructure. If the SDK times out, the SDK returns the control treatment. Code that uses the SDK must ensure that it handles the control treatment appropriately.
Hardware requirements
Split has done extensive testing, but it is important to understand that each environment is different. The minimum requirements we tested using Amazon AWS cloud for a production workload were:
-
Split Sync process: AWS EC2 m5.large/m5a.large, 2 vCPUs, 8GB RAM
-
Redis: AWS ElastiCache cache.m5.large, 2 vCPUs, 6GB RAM
Benchmark
We have run some tests for the scenario with about 100k impressions per synchronizer + Redis node pairs, on an 8 node Redis cluster.
We tested the setup with a few scenarios, using the machines that are outline below and all metrics were healthy:
Redis machines: cache.m5.4xlarge from AWS
-
16 cores and 52gb ram, up to 10 Gigabit network
-
We used 8 masters with a replica for each of them.
Split Synchronizer machines running inside a cluster in k8s, on top of a c5a.24xlarge instance type (from AWS, 96 cores + 192GB Ram)
-
Each synchronizer instance was assigned:
-
8 cores
-
32 gb ram
We validated a few scenarios and no performance problems were observed.
The scenarios are the following:
-
Slow ramp up of traffic from 200k to 800k impressions per second
-
Full ramp to 800k impressions per second
-
Simulated a spike in traffic of 2X to 1.6M impressions per second
-
Stopped a Synchronizer for 5m, then restarted it.
Alerts
We recommend the following alerts:
-
Split Sync process. Keep CPU under 50% utilization to avoid any performance degradation and to prevent the Split Sync process from falling behind.
-
Redis. Keep CPU under 50% utilization. Memory should remain under safe limits, 60 or 70% should be ok, but make sure to monitor the rate of growth. Running at all times at 70% constant utilization could be ok, however if the rate of growth is 10% every 5 minutes that will likely be a problem, and a sign that Split Sync is not able to keep up with evicting data. If Redis memory continues to increase, try the following procedure:
-
Stop Split Sync gracefully to avoid losing data.
-
Increase by two (2) the number of threads dedicated to post impressions.
-
Config key: impressionsThreads.
-
-
Start Split Sync again.
-
Repeat if the memory consumption remains in an increasing pattern.
Alerting on CONTROL treatment can also be set at the Split Synchronizer level by setting an impression listener described in the Split Synchronizer guide. This approach is similar to the SDK as described at the top of this runbook, but from the Synchronizer standpoint.
Health Check Monitors
We have two monitors to periodically validate the Synchronizer health.
One monitor is in charge of the health of the application which means it verifies that the Synchronizer synchronization tasks are running correctly and it has access to the storage.
-
In order to consume this information, issue a GET request to /health/application. This endpoint has two possible responses:
-
200 OK
{
"healthy": true,
"healthySince": "2021-10-29T15:59:21.231209-03:00",
"items": [
{
"name": "Splits",
"healthy": true,
"lastHit": "2021-10-29T16:01:52.04807-03:00"
},
{
"name": "Segments",
"healthy": true,
"lastHit": "2021-10-29T16:01:52.106651-03:00"
},
{
"name": "Storage",
"healthy": true,
"lastHit": "2021-10-29T16:19:21.446657-03:00"
}
]
}
- 500 Internal Server Error
{
"healthy": false,
"items": [
{
"name": "Splits",
"healthy": false,
"lastHit": "2021-10-29T16:01:52.04807-03:00"
},
{
"name": "Segments",
"healthy": true,
"lastHit": "2021-10-29T16:01:52.106651-03:00"
},
{
"name": "Storage",
"healthy": true,
"lastHit": "2021-10-29T16:19:21.446657-03:00"
}
]
}
If the monitor detects that the Synchronizer isn't syncing in a threshold of time, this fails and returns 500. The Synchronizer calculates the threshold from the refresh rate or from the expiration token if it is running in streaming mode.
The second monitor is in charge of the health of the dependencies, it verifies the health of the external services that the Synchronizer consumes.
- In order to consume this information, you should issue a GET request
to /health/dependencies
This endpoint always returns 200 along with the state for each dependency.
{
"serviceStatus": "healthy",
"dependencies": [
{
"service": "https://telemetry.split.io/health",
"healthy": true,
"healthySince": "2021-10-29T16:34:58.272479-03:00"
},
{
"service": "https://auth.split.io/health",
"healthy": true,
"healthySince": "2021-10-29T16:34:58.272484-03:00"
},
{
"service": "https://sdk.split.io/api/health",
"healthy": true,
"healthySince": "2021-10-29T16:34:58.272486-03:00"
},
{
"service": "https://events.split.io/api/health",
"healthy": true,
"healthySince": "2021-10-29T16:34:58.272487-03:00"
},
{
"service": "https://streaming.split.io/health",
"healthy": true,
"healthySince": "2021-10-29T16:34:58.272488-03:00"
}
]
}
Alerting on logs
To augment alerts using log-based alerts, considering the following:
During Redis errors, the Split Sync shows the following lines:
connect: connection refused
For any other I/O errors, you should see:
Error fetching segment
Error fetching splits
or, for a more generic error:
Error fetching
When you manually execute operations such as dropping or flushing impressions or events, an error is received if another operation is running at the same time.
In Debug level, the following log appears for flushing:
Cannot execute flush. Another operation is performing operations in Events.
and for impressions
Cannot execute flush. Another operation is performing operations in Impressions.
In Debug level, the following log appears for dropping:
Cannot execute drop. Another operation is performing operations in Events.
and for impressions
Cannot execute drop. Another operation is performing operations in Impressions.
Additionally, the Synchronizer performs automatic eviction for events and impressions. Manual and automatic eviction are not executed at the same time. In other words, if some eviction is running, the process skips the new operation. In Debug level, it informs the following message:
Another task is performing operations on Events. Skipping.
and for impressions
Another task is performing operations on Impressions. Skipping.
Webhook
If you want to track messages using Slack,do it by adding the webhook URL and the slack channel into the configuration of Split Synchronizer.
How you start your Synchronizer determines how you add this parameter:
JSON | CLI PARAMETER | DOCKER ENV | TYPE | DESCRIPTION |
---|---|---|---|---|
channel | slack-channel | SPLIT_SYNC_SLACK_CHANNEL | string | Set the Slack channel or user to report a summary in realtime of ERROR log level. |
webhook | slack-webhook | SPLIT_SYNC_SLACK_WEBHOOK | string | Set the Slack webhook URL to report a summary in realtime of ERROR log level. |
With this webhook, you can track error level messages, when the Split Synchronizer starts, when the Split Synchronizer is gracefully shut down, or if it was forced to stop.
Checking to see if Split Synchronizer is configured properly
The following are a set of config keys that are used to adjust the settings for more appropriate ones if your workload demands a custom configuration. The defaults keep the memory consumption constant, even at high load; however, not all workloads are similar. This guide provides instructions for measuring the capacity of the Synchronizer process and fine-tuning it if the default settings are not sufficient. With version 5 of the Synchronizer, this is still a valid check but the result should be close to 0 at all times because it constantly evicts impressions and events with this new version.
Fp: frequency per post determined by ImpressionsPostRate and EventsPostRate
Ap: amount of Impressions or Events per post, determined by ImpressionsPerPost and EventsPerPost
T# = Number of threads used for sending data
T(h) = (3,600/Fp) * Ap * T#
This is the total amount of impressions or events flushed per hour.
Es = Events generated per second.
Eh = Es x 3,600 (Events generated per hour)
Is = Impressions generated per second
Ih = Is x 3,600 (Impressions generated per hour)
Xh = Depending on what you are calculating, use either Ih or Eh to analyze Impressions or Events respectively. Using the previously described definitions, replace them with their values below to calculate lambda(ℷ) and analyze its value according to the description below:
ℷ = T(h) / Xh
If ℷ = 1: The current configuration is processing events or impressions without keeping elements in the stack. In other words, eviction rate = generation rate. Split Synchronizer is able to flush data as it arrives in the system from the SDKs.
If ℷ < 1: The current configuration may not be enough to process all the data coming in, and over time, it may produce an always-increasing memory footprint. The recommendation is to increase the number of threads or reduce the frequency for evicting elements. We recommend increasing the number of threads if they are still using the default value of 1, and to not exceed the number of cores. However, when reducing the frequency of element eviction (flush operation), decrease the value in a conservative manner by increments of ten or twenty percent each time.
Example 1
We calculate the performance of Impressions considering the following
configuration scenario:
-
Impressions Post Rate (Fp) = 60 Seconds
-
Impressions Per Post (Ap) = 1,000
-
Impressions generated per second (Is) = 3
-
Impressions generated in one hour (Ih) = 3*3,600 = 10,800
-
Number of threads (T#) = 1
Let's do some math. The total amount of impressions sent per hour is driven by:
3,600 seconds in one hour / Fp * Ap * T1 or T(h) = (3,600/60) * 1000 * 1 = 60,000
being Xh = Ih = 10,800 (how many impressions per hour).
Then our ℷ factor is determined by
ℷ = T(h) / Xh = 60,000 / 10,800 = 5,555
- ℷ is higher than 1: The configuration above supports some peaks and send all the impressions.
Example 2
Now let’s consider a higher number of Impressions per hour with the same configuration as before:
-
Impressions Post Rate (Fp) = 60 Seconds
-
Impressions Per Post (Ap) = 1,000
-
Impressions generated per second (Is) = 30
-
Impressions generated in one hour (Ih)= 30*3,600 = 108,000
-
Number of threads (T#) = 1
Let's do more math:
- T(h) = (3,600/60) * 1000 * 1 = 60,000
- Xh = Ih = 108,000
- ℷ = T(h) / Xh = 60,000 / 108,000 = 0.55
- ℷ is less than 1: The configuration above is not enough to flush impressions. It needs more than one hour to evict and send all the elements to the Split servers. However, the Synchronizer continues to receive elements over the next hour. In this case, the corrective action is increasing the number of threads if the default is one. Then, proceed to decrease the rate of flush as indicated in the previous section.
- T# = 2 or Fp = 30 (is good enough)
Note: Eviction could also be executed manually, but keep in mind that this is a manual task that sends 5 batches of 5,000 elements (25,000 in total per call). For this case, 2 calls to that manual eviction need to execute to evict the 48,000 pending impressions.
Example 3
Let's try with a higher number of Impressions generated per hour:
-
Impressions Post Rate (Fp) = 60 Seconds
-
Impressions Per Post (Ap) = 1,000
-
Impressions generated per second (Is) = 300
-
Impressions generated in one hour (Ih)= 300*3,600 = 1,080,000
-
Number of threads (T#) = 2
Let's do more math:
- T(h) = (3,600/60) * 1000 * 2 = 120,000
- Xh = Ih = 1,080,000
- ℷ = T(h) / Xh = 120,000 / 1,080,000 = 0.11
- ℷ is less than 1: This indicates that the process is not adequately provisioned to successfully send all generated impressions. If this is only a peak, it takes more than 8 hours to send all of the impressions to Split servers (120,000 impressions are sent per hour), even if using more than one thread. Manual eviction is also not the solution. In this case, editing the configuration is a better approach or strategy to follow.
T# = 5 and Fp = 15 could be an example of making sure that the generation rate will be less than the eviction rate.
Deploying Synchronizer to AWS ECS securely
To create a task definition that doesn't reveal API keys and passwords in the environment, create parameters in AWS Systems Manager Parameter store, then reference their ARN when adding the ECS environment variables:
Upgrading the Synchronizer
To upgrade the Synchronizer, do the following:
-
Stop Split Sync
gracefully -
Upgrade Split Sync binary
-
Start the service again
-
Watch the logs for a short period of time to make sure no warnings arise.
Comments
0 comments
Please sign in to leave a comment.