-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Active record poisoned transaction with postgres #881
Changes from 6 commits
e58b257
33153db
16988d9
028a9bb
110820b
8695db3
1c9e0a2
0a38a46
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
require 'set' | ||
require 'securerandom' | ||
require 'flipper' | ||
require 'active_record' | ||
|
||
|
@@ -244,17 +245,26 @@ def delete(feature, gate) | |
end | ||
|
||
def enable_multi(feature, gate, thing) | ||
with_connection(@gate_class) do | ||
@gate_class.create! do |g| | ||
g.feature_key = feature.key | ||
g.key = gate.key | ||
g.value = thing.value.to_s | ||
with_connection(@gate_class) do |connection| | ||
begin | ||
connection.transaction do | ||
savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}" | ||
connection.execute("SAVEPOINT #{savepoint_name}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And that's also probably why you are "poisoning" the connection. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome! I'll try that out. Thanks for taking the time to teach! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1c9e0a2 did the trick. Still odd to me that mysql and sqlite don't raise about the poisoned transaction and pg gem does. But perhaps it is something in the pg gem and not the AR adapter for Postgres?
|
||
begin | ||
@gate_class.create! do |g| | ||
g.feature_key = feature.key | ||
g.key = gate.key | ||
g.value = thing.value.to_s | ||
end | ||
rescue ::ActiveRecord::RecordNotUnique | ||
# already added so no need move on with life | ||
connection.execute("ROLLBACK TO SAVEPOINT #{savepoint_name}") | ||
end | ||
end | ||
end | ||
end | ||
|
||
nil | ||
rescue ::ActiveRecord::RecordNotUnique | ||
# already added so no need move on with life | ||
end | ||
|
||
def result_for_gates(feature, gates) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that: pg transaction
savepoints
can cause performance degradation on postgres.Gitlab have blogged about this and how they remove them from their app https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camallen yep, I'm aware. This is a very low throughput method. Even at a large company/app I'd be surprised if it was executed more than 100 times in a day. I think that level of throughput this should be fine. Knowing that, are you still worried?
If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a little worried about this change, perhaps unduly.
From my reading the gitlab investigation showed that any savepoint use on the primary can impact all replicas as the replica has to record all active transactions after the use of a savepoint and this tracking is a problem during any 'long' running queries. On a very active system the duration of a 'long' query will vary as it depends on the number of active transactons. If there are enough they will saturate the pg server cache and spill from memory to disk and then impact query perf due to slower IO on hot code paths.
My concern is around the the unpredictable nature of this issue due to the interplay of the active transaction count and the duration of any queries that impact the replicas. It'll appear like random performance degradation that (hopefully) eventually self heals after the queries complete.
Teams that use replicas most likely are already experiencing a high transaction load on the primary and this change then might open them up to this situation.
One of the proposed solutions (that may have landed in new pg version) was to allow a configurable PG server cache size to avoid spilling over to disk and slow IO on a hot code path. However by default that isn't available on older pg versions and may not be configurable on pg cloud variants.
Again i may be reading it wrong, I'm wary as I've been bitten in the past from long running queries impacting replica query performance and the random nature of these issues.
ARs postgres adapter has a private in_transaction? method to detect if we're in a transaction that uses
open_transactions
. We might be able to introspect on the connection adapter's open_transactions size to determine if we're in a transaction and if we need to use a savepoint sql directive instead of opting everyone into the use of savepoints by default.I'm not privy if there is a better way to introspect the AR connection transaction stack state to determine if we can 'opt' in to the use nested transactions. Maybe @byroot has thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the code, so hard. But it looks like it's trying to insert a record and ignore if the record already exist? If so then,
ON DUPLICATE KEY IGNORE
wouldn't require a sub transaction.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref to similar discussion when we used a savepoint in find_or_create rails/rails#51052 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camallen this isn't a hot code path and shouldn't ever be.
If it is, then you aren't using flipper correctly. Individual actors is for enabling a feature for a few people here and there per feature. Anything that requires 100+ should be in a group or % rollout as detailed in the docs.
Any write in flipper is a rare thing compared to reads and that is what flipper is optimized for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion, when I used the 'hot code path' term, I was referring to the PostgreSQL server when a
savepoint
is used on a primary and the replica then has to start tracking all know transactions. I wasn't referring to flipper configuring a feature flag, though flipper could be the trigger when it issues asavepoint
transaction.I still think there is a risk with this change. If the GitLab investigation is correct then this change introduces a ticking bomb that may at some point explode for busy PostgreSQL systems that have replicas enabled. Similar to the concerns raised in rails/rails#51052
Would you be open to a PR that attempts to use the upsert on conflict do nothing approach that byroot mentioned? I'd be up for working on this if it would be considered to land.
A naive implementation would run the model validations and if the model is valid it would then use ARs
insert
/insert_all
methods combined with :unique_by to skip the duplicate records and thus not poison the outer transaction.A caveat being that the
insert on conflict
feature is only available on PG versions 9.5+ and active record supports PG 9.3+ versions . It appears that flipper supports active record 4.2+ so any implementation would require branching code based on rails pg server feature availability detection.If the server doesn't support
ON CONFLICT DO NOTHING
SQL then the currentrequires_new
savepoint would be used.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the configurable cache sizes have landed in PG 17 https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camallen happy to take a look at a PR. Again, this is super low throughput so I don't believe it'll be an issue.