Introduce ReplicationCoordinator to support multiple AZs

flavorjones · flavorjones · commit 4c6654afe545 · 2025-06-07T16:34:26.000-04:00
It's common for applications that are deployed across multiple
availability zones (using a replicated database) to create an ad-hoc
method for indicating which zone is "active", meaning the zone
primarily responsible for writing to the database.

For example, a team may choose to use a MySQL system variable to
indicate the data center where the primary database sits. In which
case, they need to write code to make sure all Rails processes in all
zones query this efficiently (it may be slow to access in non-primary
zones) and are notified if the primary zone changes, as in the case of
a data center failover.

ReplicationCoordinator::Base is introduced to allow developers to
write code that determines whether a process is in an active zone, and
then:

- monitor and cache that value, with configurable polling interval
- fire callbacks when the state changes from active -&gt; passive or vice versa
diff --git a/activesupport/lib/active_support.rb b/activesupport/lib/active_support.rb
@@ -54,6 +54,7 @@ module ActiveSupport
   autoload :IsolatedExecutionState
   autoload :Notifications
   autoload :Reloader
+  autoload :ReplicationCoordinator
   autoload :SecureCompareRotator
 
   eager_autoload do
diff --git a/activesupport/lib/active_support/replication_coordinator.rb b/activesupport/lib/active_support/replication_coordinator.rb
@@ -0,0 +1,256 @@
+# frozen_string_literal: true
+
+module ActiveSupport
+  # = ActiveSupport::ReplicationCoordinator
+  #
+  # The \ReplicationCoordinator module supports applications that run in multiple availability zones.
+  #
+  # == Replication, Availability Zones, and Active-Passive State
+  #
+  # A common deployment topology for Rails applications is to have application servers running in
+  # multiple availability zones, with a single database that is replicated across these zones.
+  #
+  # In such deployment, application code may need to determine whether it is running an "active"
+  # zone and is responsible for writing to the database, or in a "passive" or "standby" zone that
+  # primarily reads from the zone-local database replica. And, in case of a zone failure, the
+  # application may need to be able to dynamically switch a passive zone to an active zone (or vice
+  # versa).
+  #
+  # The term "Passive" here is intended to include deployments in which the non-active zones are
+  # handling read requests, and potentially even performing occasional writes back to the active
+  # zone over an inter-AZ network link. The exact interpretation depends on the nature of the
+  # replication strategy and your deployment topology.
+  #
+  # Some example scenarios where knowing the replication state is important:
+  #
+  # - Custom database selector middleware
+  # - Controlling background jobs that should only run in an active zone
+  # - Deciding whether to preheat fragment caches for "next page" paginated results (which may not
+  #   be cached in time if relying on an inter-AZ network link and replication lag).
+  #
+  # The two classes in this module are:
+  #
+  # - ReplicationCoordinator::Base: An abstract base class that provides a monitoring
+  #   mechanism to fetch and cache the replication state on a configurable time interval and notify
+  #   when that state changes.
+  # - ReplicationCoordinator::SingleZone: A concrete implementation that always
+  #   indicates an active zone, and so it represents the default behavior for a single-zone
+  #   deployment that does not use database replication.
+  #
+  module ReplicationCoordinator
+    # = Replication Coordinator Abstract Base Class
+    #
+    # An abstract base class that provides a monitoring mechanism to fetch and cache the replication
+    # state on a configurable time interval and notify when that state changes.
+    #
+    # Subclasses must only implement #fetch_active_zone, which returns a boolean indicating whether
+    # the caller is in an active zone. This method may be expensive, so the class uses a
+    # Concurrent::TimerTask to periodically check (and cache) this value. The current cached status
+    # can cheaply be inspected with #active_zone?. The refresh interval can be set by passing a
+    # +polling_interval+ option to the constructor.
+    #
+    # The timer task must be explicitly started by calling #start_monitoring. Once started,
+    # registered callbacks are invoked when an active zone change is detected.
+    #
+    # == Basic usage
+    #
+    #   class CustomReplicationCoordinator < ActiveSupport::ReplicationCoordinator::Base
+    #     def fetch_active_zone
+    #       # Custom logic to determine if the local zone is active
+    #     end
+    #   end
+    #
+    #   coordinator = CustomReplicationCoordinator.new(polling_interval: 10.seconds)
+    #
+    #   coordinator.active_zone? # Immediately returns the cached value
+    #
+    #   coordinator.on_active_zone do |coordinator|
+    #     puts "This zone is now active"
+    #     # Start processes or threads that should only run in the active zone
+    #   end
+    #
+    #   coordinator.on_passive_zone do |coordinator|
+    #     puts "This zone is now passive"
+    #     # Stop processes or threads that should only run in the active zone
+    #   end
+    #
+    #   # Start a background thread to monitor the active zone status and invoke the callbacks on changes
+    #   coordinator.start_monitoring
+    #
+    #   coordinator.updated_at # Returns the last time the active zone status was checked
+    #
+    # Subclasses must implement #fetch_active_zone
+    class Base
+      attr_reader :state_change_hooks, :polling_interval, :executor, :logger, :updated_at
+
+      # Initialize a new coordinator instance.
+      #
+      # [+polling_interval+] How often to refresh active zone status (default: 5 seconds)
+      def initialize(polling_interval: 5, executor: ActiveSupport::Executor, logger: nil)
+        @state_change_hooks = { active: [], passive: [] }
+        @polling_interval = polling_interval
+        @executor = executor
+        @logger = logger || (defined?(Rails.logger) && Rails.logger)
+
+        @last_active_zone = nil
+        @updated_at = nil
+        @active_zone_watcher = nil
+
+        check_active_zone
+      end
+
+      # Determine if the local zone is active.
+      #
+      # This method must be implemented by subclasses to define the logic for determining if the
+      # local zone is active. The return value is used to trigger state change hooks when the active
+      # zone changes.
+      #
+      # It's assumed that this method may be slow, so ReplicationCoordinator has a background thread
+      # that calls this method every +polling_interval+ seconds, and caches the result which is
+      # returned by #active_zone?
+      #
+      # Returns +true+ if the local zone is active, +false+ otherwise.
+      def fetch_active_zone
+        raise NotImplementedError
+      end
+
+      # Returns +true+ if the local zone is active, +false+ otherwise.
+      #
+      # This always returns a cached value.
+      def active_zone?
+        @last_active_zone
+      end
+
+      # Start monitoring for active zone changes.
+      #
+      # This starts a Concurrent::TimerTask to periodically refresh the active zone status. If a
+      # change is detected, then the appropriate state change callbacks will be invoked.
+      def start_monitoring
+        active_zone_watcher.execute
+      end
+
+      # Stop monitoring for active zone changes.
+      #
+      # This stops the Concurrent::TimerTask, if it is running.
+      def stop_monitoring
+        @active_zone_watcher&.shutdown
+      end
+
+      # Register a callback to be executed when the local zone becomes active.
+      #
+      # The callback will be immediately executed if this zone is currently active.
+      #
+      # [+block+] callback to execute when zone becomes active
+      #
+      # Yields the coordinator instance to the block.
+      def on_active_zone(&block)
+        state_change_hooks[:active] << block
+        block.call(self) if active_zone?
+      end
+
+      # Register a callback to be executed when the local zone becomes passive.
+      #
+      # The callback will be immediately executed if this zone is not currently active.
+      #
+      # [+block+] callback to execute when zone becomes passive
+      #
+      # Yields the coordinator instance to the block.
+      def on_passive_zone(&block)
+        state_change_hooks[:passive] << block
+        block.call(self) if !active_zone?
+      end
+
+      # Clear all registered state_change hooks.
+      def clear_hooks
+        state_change_hooks[:active] = []
+        state_change_hooks[:passive] = []
+      end
+
+      private
+        def active_zone_watcher
+          @active_zone_watcher ||= begin
+            task = Concurrent::TimerTask.new(execution_interval: polling_interval) do
+              check_active_zone
+            end
+
+            task.add_observer do |_, _, error|
+              if error
+                executor.error_reporter&.report(error, handled: false, source: "replication_coordinator.active_support")
+                logger&.error("#{error.detailed_message}: could not check #{self.class} active zone")
+              end
+            end
+
+            task
+          end
+        end
+
+        def check_active_zone
+          new_active_zone = executor_wrap { fetch_active_zone }
+          @updated_at = Time.now
+
+          if @last_active_zone.nil? || new_active_zone != @last_active_zone
+            @last_active_zone = new_active_zone
+
+            if @last_active_zone
+              logger&.info "#{self.class}: pid #{$$}: switching to active"
+              run_active_zone_hooks
+            else
+              logger&.info "#{self.class}: pid #{$$}: switching to passive"
+              run_passive_zone_hooks
+            end
+          end
+        end
+
+        def run_active_zone_hooks
+          run_hooks_for(:active)
+        end
+
+        def run_passive_zone_hooks
+          run_hooks_for(:passive)
+        end
+
+        def run_hooks_for(event)
+          state_change_hooks.fetch(event, []).each do |block|
+            block.call(self)
+          rescue Exception => exception
+            handle_thread_error(exception)
+          end
+        end
+
+        def executor_wrap(&block)
+          if @executor
+            @executor.wrap(&block)
+          else
+            yield
+          end
+        end
+    end
+
+    # = "Single Zone" Replication Coordinator
+    #
+    # A concrete implementation that always indicates an active zone, and so it represents the
+    # default behavior for a single-zone deployment that does not use database replication.
+    #
+    # This is a simple implementation that always returns +true+ from #active_zone?
+    #
+    # == Basic usage
+    #
+    #   cluster = ActiveSupport::ReplicationCoordinator::SingleZone.new
+    #   cluster.active_zone? #=> true
+    #   cluster.on_active_zone { puts "Will always be called" }
+    #   cluster.on_passize_zone { puts "Will never be called" }
+    #   cluster.start_monitoring # Does nothing, since there is no monitoring needed.
+    class SingleZone < Base
+      # Always returns true, indicating this zone is active.
+      #
+      # Returns true.
+      def fetch_active_zone
+        true
+      end
+
+      def start_monitoring
+        # No-op implementation since no monitoring is needed.
+      end
+    end
+  end
+end
diff --git a/activesupport/lib/active_support/testing/replication_coordinator.rb b/activesupport/lib/active_support/testing/replication_coordinator.rb
@@ -0,0 +1,33 @@
+# frozen_string_literal: true
+
+module ActiveSupport
+  module Testing
+    # ReplicationCoordinator is a concrete implementation of a
+    # ActiveSupport::ReplicationCoordinator::Base that can be used to test the behavior of objects
+    # that depend on replication state.
+    class ReplicationCoordinator < ActiveSupport::ReplicationCoordinator::Base
+      attr_reader :fetch_count
+
+      # Initializes the replication coordinator with an initial active zone state.
+      #
+      # The replication coordinator can be initialized with an initial active zone state using the
+      # optional +active_zone+ parameter, which defaults to +true+.
+      def initialize(active_zone = true, **options)
+        @next_active_zone = active_zone
+        @fetch_count = 0
+        super(**options)
+      end
+
+      # Sets the value that will next be returned by #fetch_active_zone, simulating an external
+      # replication state change.
+      def set_next_active_zone(active_zone)
+        @next_active_zone = active_zone
+      end
+
+      def fetch_active_zone # :nodoc:
+        @fetch_count += 1
+        @next_active_zone
+      end
+    end
+  end
+end
diff --git a/activesupport/test/replication_coordinator_test.rb b/activesupport/test/replication_coordinator_test.rb