feat(resilience): implement proactive rpc failover and process-safe state locking#554
Open
Vicky08100 wants to merge 1 commit into
Open
Conversation
|
@Vicky08100 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #453
Closes #510
PR Description
This pull request implements resilience and concurrency safety improvements to the StellarFlow Backend.
Proactive RPC Failover Management (Issue #510)
Previously, waiting for an active Soroban RPC node connection to time out during critical ledger submission windows caused transaction pipelines to stall. To address this:
RPCNodeFailoverSupervisor) insrc/network/nonce_tracker.pythat runs a background daemon thread checking RPC node health periodically.getHealthJSON-RPC method to measure RTT latency.FailoverRouterinsrc/network/rpc_client.pywas refactored to obtain the best active endpoint from the supervisor rather than failing sequentially and waiting for the 3.5s timeout.Guarding Shared Subprocess State (Issue #453)
When the ingestion engine spawns secondary worker processes to evaluate cryptographic paths, simultaneous writes to local shared state maps risk throwing file collision errors. To solve this:
StateRegisterinsrc/utils/state.pyto use amultiprocessing.Lockalong with Unix advisory file-locking via the Python standardfcntlmodule on Linux.os.replaceto prevent file collision errors.Changes Made
src/network/nonce_tracker.pyto implementRPCNodeFailoverSupervisorand export the defaultrpc_supervisorinstance.src/network/rpc_client.pyto integrateFailoverRouterwithRPCNodeFailoverSupervisorand implement clean destructor shutdown behavior.src/utils/state.pyto add advisory file-locking and atomic persistence toStateRegister.tests/test_nonce_tracker.pyto add tests verifying proactive failover on high latency and node connection failures.tests/test_state.pyto test the state register basic functions and multiprocess concurrency safety.tests/test_rpc_client.pyto verify the integration ofFailoverRouterwith the proactive supervisor.Testing
The test suite was run using the following command:
All 35 tests passed successfully.
Scope Notes
fcntl,tempfile,multiprocessing, andrequests).