-
Notifications
You must be signed in to change notification settings - Fork 3.9k
4.0: Use persistent_term
for the feature flag registry
#10988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
very creative! |
1a0cc72
to
e7de95c
Compare
b6a7a14
to
f2b8c5d
Compare
I took some microbenchmarks of the two approaches - generating a module (albeit an Elixir module in the test) and reading from a (which can be run with Benchee results...
|
I also ran this branch against the performance tests vs. main: https://grafana.lionhead.rabbitmq.com/d/1OSIKB4Vk/rabbitmq-performance-tests?orgId=1&from=1713189778195&to=1713195192973&shareView=link At a glance the two branches are hard to tell apart. The persistent_term may be slower but we might not read feature flags so often that it's noticeable. @mkuratczyk could I bug you to take a look and weigh in on these results when you get a chance? |
Yeah, I can't think of code paths where this could slow down normal operations. Small differences between environments are unfortunately always there in these tests (there are multiple reasons for this, including intentional randomization in a few places, both in perf-test as well as on the broker side). I'm fairly sure the observed differences are due to such entropy, not this PR |
f2b8c5d
to
b95c129
Compare
receive {'DOWN', MRef, process, Spammer, _} -> ok end. | ||
receive {'DOWN', MRef, process, Spammer, _} -> ok end, | ||
%% All feature flags appeared. | ||
?assertEqual(FinalFFList, ?list_ff(all)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion fails on main but it's because of a bug in the test helpers (see 2e4e321) rather than a bug in the feature flag system
On an M1 machine with 10 cores and a fast SSD, this shaves some 9% off of node-reported startup time 👏 on top of the improvements in #10989. This should nicely benefit all future CI runs where clusters are formed and nodes are restarted many times in total. |
77d0c90
to
53a1d5d
Compare
persistent_term
for the feature flag registrypersistent_term
for the feature flag registry
Previously the feature flag registry was a module which was regenerated (i.e. with `erl_syntax` and `compile:forms/2`) whenever the set of known feature flags or feature flag states changed. Compiling and loading a module is fairly expensive though, costing around 35-40ms for me locally, and the registry module is regenerated many times on boot. So this change saves a fair amount of time on boot. A regular single-node broker boots locally for me in around 4250ms via `bazel run broker`. With this patch the boot time falls to around 3300ms. This patch also improves the time it takes to enable individual feature flags. `time rabbitmqctl enable_feature_flag khepri_db` for example changes locally for me from around 1.2s to 1.0s with this change.
This prevents a potential race which was possible when two processes attempted to initialize the feature flag registry at the same time. One process attempting to initialize the registry could take a relatively long time to read the available feature flags meanwhile another process could read a more recently changed set of available feature flags. Locking before reading and writing the flag prevents one update from clobbering the other. This race was not possible before the registry used a persistent_term because we read the version of the registry module as the very first step when initializing. When we created a new module we checked the version again and retried the update if it changed in the meantime. That retry mechanism has been removed since we no longer track the current version of the registry. Instead we can use a local lock for the same effect.
`rabbit_feature_flags:inject_test_feature_flags/2` could discard flags from the persistent term because the read and write to the persistent term were not atomic. `feature_flags_SUITE:registry_concurrent_reloads/1` spawns many processes to modify the feature flag registry concurrently but also updates this persistent term concurrently. The processes would race so that many would read the initial flags, add their flag and write that state, discarding any flags that had been written in the meantime. We can add a lock around changes to the persistent term to make the changes atomic. Non-atomic updates to the persistent term caused unexpected behavior in the `registry_concurrent_reloads/1` case previously. The `PT_TESTSUITE_ATTRS` persistent_term only ended up containing a few of the desired feature flags (for example only `ff_02` and `ff_06` along with the base `ff_a` and `ff_b`). The case did not fail because the registry continued to make progress towards that set of feature flags. However the test case never reached the "all feature flags appeared" state of the spammer and so the new assertion added at the end of the case in this commit would fail.
These test cases and RPC calls were concerned with deadlocks possible from modifying the code server process from multiple callers. With the switch to persistent_term in the parent commits these kinds of deadlocks are no longer possible. We should keep the RPC calls to `rabbit_ff_registry_wrapper:inventory/0` though for mixed-version compatibility with nodes that use module generation instead of `persistent_term` for their registry.
53a1d5d
to
79e1350
Compare
Force-push was just a rebase |
[Why] Up-to RabbitMQ 3.13.x, there was a case where if: 1. you enabled a plugin 2. you enabled its feature flags 3. you disabled the plugin 4. you restarted a node (or upgraded it) ... the node could crash on startup because it had a feature flag marked as enabled that it didn't know about: error:{badmatch,#{feature_flags => ... rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514 lists:all_1/2, line 1520 rabbit_ff_controller:are_compatible/2, line 496 rabbit_ff_controller:check_node_compatibility_task1/4, line 437 rabbit_db_cluster:check_compatibility/1, line 376 This was "fixed" by the new way of keeping the registry in memory (#10988) because it introduces a slight change of behavior. Indeed, the old way walked through the `FeatureFlags` map and looked up the state in the `FeatureStates` map to create the `is_enabled/1` function. The new way just looks up the state in `FeatureStates`. [How] The new testcase succeeds on 4.0.x and `main`, but would fail on 3.13.x with the aforementionne crash.
[Why] Up-to RabbitMQ 3.13.x, there was a case where if: 1. you enabled a plugin 2. you enabled its feature flags 3. you disabled the plugin 4. you restarted a node (or upgraded it) ... the node could crash on startup because it had a feature flag marked as enabled that it didn't know about: error:{badmatch,#{feature_flags => ... rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514 lists:all_1/2, line 1520 rabbit_ff_controller:are_compatible/2, line 496 rabbit_ff_controller:check_node_compatibility_task1/4, line 437 rabbit_db_cluster:check_compatibility/1, line 376 This was "fixed" by the new way of keeping the registry in memory (#10988) because it introduces a slight change of behavior. Indeed, the old way walked through the `FeatureFlags` map and looked up the state in the `FeatureStates` map to create the `is_enabled/1` function. The new way just looks up the state in `FeatureStates`. [How] The new testcase succeeds on 4.0.x and `main`, but would fail on 3.13.x with the aforementionne crash.
This is an experiment for using
persistent_term
to hold the feature flag registry.rabbit_ff_registry
is currently implemented with a module that is dynamically regenerated witherl_syntax
,compile:forms/2
and code reloading. The module approach has the advantage of maximizing read speed but it has a few drawbacks:rabbit_ff_registry
functions.bazel run broker
) and a total savings of ~200ms withrabbitmqctl enable_feature_flag khepri_db
for example.The question is whether
persistent_term
will be fast enough. Running the usualone_fast
perf-test flag setup I didn't see a difference between this branch and main but we'll want to run this through the full performance test suite before considering it.This branch also includes a concurrency fix for
rabbit_feature_flags:inject_test_feature_flags/2
which can cause thefeature_flags_SUITE:registry_concurrent_reloads/1
case to misbehave.