Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active record poisoned transaction with postgres #881

Merged
merged 8 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions lib/flipper/adapters/active_record.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
require 'set'
require 'securerandom'
require 'flipper'
require 'active_record'

Expand Down Expand Up @@ -244,17 +245,26 @@ def delete(feature, gate)
end

def enable_multi(feature, gate, thing)
with_connection(@gate_class) do
@gate_class.create! do |g|
g.feature_key = feature.key
g.key = gate.key
g.value = thing.value.to_s
with_connection(@gate_class) do |connection|
begin
connection.transaction do
savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}"
connection.execute("SAVEPOINT #{savepoint_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that: pg transaction savepoints can cause performance degradation on postgres.

Gitlab have blogged about this and how they remove them from their app https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen yep, I'm aware. This is a very low throughput method. Even at a large company/app I'd be surprised if it was executed more than 100 times in a day. I think that level of throughput this should be fine. Knowing that, are you still worried?

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that level of throughput this should be fine. Knowing that, are you still worried?

I am still a little worried about this change, perhaps unduly.

From my reading the gitlab investigation showed that any savepoint use on the primary can impact all replicas as the replica has to record all active transactions after the use of a savepoint and this tracking is a problem during any 'long' running queries. On a very active system the duration of a 'long' query will vary as it depends on the number of active transactons. If there are enough they will saturate the pg server cache and spill from memory to disk and then impact query perf due to slower IO on hot code paths.

My concern is around the the unpredictable nature of this issue due to the interplay of the active transaction count and the duration of any queries that impact the replicas. It'll appear like random performance degradation that (hopefully) eventually self heals after the queries complete.

Teams that use replicas most likely are already experiencing a high transaction load on the primary and this change then might open them up to this situation.

One of the proposed solutions (that may have landed in new pg version) was to allow a configurable PG server cache size to avoid spilling over to disk and slow IO on a hot code path. However by default that isn't available on older pg versions and may not be configurable on pg cloud variants.

Again i may be reading it wrong, I'm wary as I've been bitten in the past from long running queries impacting replica query performance and the random nature of these issues.

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

ARs postgres adapter has a private in_transaction? method to detect if we're in a transaction that uses open_transactions. We might be able to introspect on the connection adapter's open_transactions size to determine if we're in a transaction and if we need to use a savepoint sql directive instead of opting everyone into the use of savepoints by default.

I'm not privy if there is a better way to introspect the AR connection transaction stack state to determine if we can 'opt' in to the use nested transactions. Maybe @byroot has thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @byroot has thoughts.

I don't know the code, so hard. But it looks like it's trying to insert a record and ignore if the record already exist? If so then, ON DUPLICATE KEY IGNORE wouldn't require a sub transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref to similar discussion when we used a savepoint in find_or_create rails/rails#51052 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen this isn't a hot code path and shouldn't ever be.

If it is, then you aren't using flipper correctly. Individual actors is for enabling a feature for a few people here and there per feature. Anything that requires 100+ should be in a group or % rollout as detailed in the docs.

Any write in flipper is a rare thing compared to reads and that is what flipper is optimized for.

Copy link
Contributor

@camallen camallen Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a hot code path and shouldn't ever be.

Apologies for the confusion, when I used the 'hot code path' term, I was referring to the PostgreSQL server when a savepoint is used on a primary and the replica then has to start tracking all know transactions. I wasn't referring to flipper configuring a feature flag, though flipper could be the trigger when it issues a savepoint transaction.

I still think there is a risk with this change. If the GitLab investigation is correct then this change introduces a ticking bomb that may at some point explode for busy PostgreSQL systems that have replicas enabled. Similar to the concerns raised in rails/rails#51052

Would you be open to a PR that attempts to use the upsert on conflict do nothing approach that byroot mentioned? I'd be up for working on this if it would be considered to land.

A naive implementation would run the model validations and if the model is valid it would then use ARs insert / insert_all methods combined with :unique_by to skip the duplicate records and thus not poison the outer transaction.

A caveat being that the insert on conflict feature is only available on PG versions 9.5+ and active record supports PG 9.3+ versions . It appears that flipper supports active record 4.2+ so any implementation would require branching code based on rails pg server feature availability detection.

If the server doesn't support ON CONFLICT DO NOTHING SQL then the current requires_new savepoint would be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the configurable cache sizes have landed in PG 17 https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen happy to take a look at a PR. Again, this is super low throughput so I don't believe it'll be an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's also probably why you are "poisoning" the connection. conn.transaction { conn.transaction { } } doesn't do nested transaction. You need require_new: true for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'll try that out. Thanks for taking the time to teach!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1c9e0a2 did the trick.

Still odd to me that mysql and sqlite don't raise about the poisoned transaction and pg gem does. But perhaps it is something in the pg gem and not the AR adapter for Postgres?

requires_new: true is definitely cleaner than the save point junk I had so I'll go with that for now.

begin
@gate_class.create! do |g|
g.feature_key = feature.key
g.key = gate.key
g.value = thing.value.to_s
end
rescue ::ActiveRecord::RecordNotUnique
# already added so no need move on with life
connection.execute("ROLLBACK TO SAVEPOINT #{savepoint_name}")
end
end
end
end

nil
rescue ::ActiveRecord::RecordNotUnique
# already added so no need move on with life
end

def result_for_gates(feature, gates)
Expand Down
12 changes: 12 additions & 0 deletions spec/flipper/adapters/active_record_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,18 @@
flipper.preload([:foo])
end

it 'should not poison wrapping transactions' do
flipper = Flipper.new(subject)

actor = Struct.new(:flipper_id).new('flipper-id-123')
flipper.enable_actor(:foo, actor)

ActiveRecord::Base.transaction do
flipper.enable_actor(:foo, actor)
expect(Flipper::Adapters::ActiveRecord::Gate.count).to eq 1
end
end

context "ActiveRecord connection_pool" do
before do
ActiveRecord::Base.connection_handler.clear_active_connections!
Expand Down
4 changes: 2 additions & 2 deletions spec/flipper/adapters/strict_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@

context "#get" do
it "raises an error for unknown feature" do
expect(silence { subject.get(feature) }).to match(/Could not find feature "unknown"/)
expect(capture_output { subject.get(feature) }).to match(/Could not find feature "unknown"/)
end
end

context "#get_multi" do
it "raises an error for unknown feature" do
expect(silence { subject.get_multi([feature]) }).to match(/Could not find feature "unknown"/)
expect(capture_output { subject.get_multi([feature]) }).to match(/Could not find feature "unknown"/)
end
end
end
Expand Down
14 changes: 7 additions & 7 deletions spec/flipper/engine_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

let(:config) { application.config.flipper }

subject { application.initialize! }
subject { SpecHelpers.silence { application.initialize! } }

shared_examples 'config.strict' do
let(:adapter) { Flipper.adapter.adapter }
Expand Down Expand Up @@ -233,7 +233,7 @@
it "initializes cloud configuration" do
stub_request(:get, /flippercloud\.io/).to_return(status: 200, body: "{}")

application.initialize!
silence { application.initialize! }

expect(Flipper.instance).to be_a(Flipper::Cloud::DSL)
expect(Flipper.instance.instrumenter).to be_a(Flipper::Cloud::Telemetry::Instrumenter)
Expand Down Expand Up @@ -263,7 +263,7 @@
}

it "configures webhook app" do
application.initialize!
silence { application.initialize! }

stub = stub_request(:get, "https://www.flippercloud.io/adapter/features?exclude_gate_names=true").with({
headers: { "flipper-cloud-token" => ENV["FLIPPER_CLOUD_TOKEN"] },
Expand All @@ -278,7 +278,7 @@

context "without CLOUD_SYNC_SECRET" do
it "does not configure webhook app" do
application.initialize!
silence { application.initialize! }

post "/_flipper"
expect(last_response.status).to eq(404)
Expand All @@ -288,7 +288,7 @@
context "without FLIPPER_CLOUD_TOKEN" do
it "gracefully skips configuring webhook app" do
ENV["FLIPPER_CLOUD_TOKEN"] = nil
application.initialize!
silence { application.initialize! }
expect(Flipper.instance).to be_a(Flipper::DSL)

post "/_flipper"
Expand Down Expand Up @@ -324,7 +324,7 @@
end

it "enables cloud" do
application.initialize!
silence { application.initialize! }
expect(ENV["FLIPPER_CLOUD_TOKEN"]).to eq("credentials-token")
expect(ENV["FLIPPER_CLOUD_SYNC_SECRET"]).to eq("credentials-secret")
expect(Flipper.instance).to be_a(Flipper::Cloud::DSL)
Expand All @@ -339,7 +339,7 @@

describe "config.actor_limit" do
let(:adapter) do
application.initialize!
silence { application.initialize! }
Flipper.adapter.adapter.adapter
end

Expand Down
2 changes: 1 addition & 1 deletion spec/flipper/middleware/memoizer_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@
end

def get(uri, params = {}, env = {}, &block)
silence { super(uri, params, env, &block) }
capture_output { super(uri, params, env, &block) }
end

include_examples 'flipper middleware'
Expand Down
2 changes: 1 addition & 1 deletion spec/support/fail_on_output.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
if ENV["CI"] || ENV["FAIL_ON_OUTPUT"]
RSpec.configure do |config|
config.around do |example|
output = silence { example.run }
output = capture_output { example.run }
fail "Use `silence { }` to avoid printing to STDOUT/STDERR\n#{output}" unless output.empty?
end
end
Expand Down
15 changes: 13 additions & 2 deletions spec/support/spec_helpers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,22 @@ def silence
original_stdout = $stdout

# Redirect stderr and stdout
output = $stderr = $stdout = StringIO.new
$stderr = $stdout = StringIO.new

yield
ensure
$stderr = original_stderr
$stdout = original_stdout
end

def capture_output
original_stderr = $stderr
original_stdout = $stdout

output = $stdout = $stderr = StringIO.new

yield

# Return output
output.string
ensure
$stderr = original_stderr
Expand Down