Skip to content

feat(resilience): implement proactive rpc failover and process-safe state locking#554

Open
Vicky08100 wants to merge 1 commit into
StellarFlow-Network:mainfrom
Vicky08100:feature/resilience-optimizations
Open

feat(resilience): implement proactive rpc failover and process-safe state locking#554
Vicky08100 wants to merge 1 commit into
StellarFlow-Network:mainfrom
Vicky08100:feature/resilience-optimizations

Conversation

@Vicky08100

Copy link
Copy Markdown

Closes #453
Closes #510

PR Description

This pull request implements resilience and concurrency safety improvements to the StellarFlow Backend.

Proactive RPC Failover Management (Issue #510)

Previously, waiting for an active Soroban RPC node connection to time out during critical ledger submission windows caused transaction pipelines to stall. To address this:

  • We implemented a proactive monitoring supervisor (RPCNodeFailoverSupervisor) in src/network/nonce_tracker.py that runs a background daemon thread checking RPC node health periodically.
  • It uses the lightweight getHealth JSON-RPC method to measure RTT latency.
  • If a node's latency exceeds 500ms or fails, the supervisor shifts the active route to the next fastest available healthy node.
  • The FailoverRouter in src/network/rpc_client.py was refactored to obtain the best active endpoint from the supervisor rather than failing sequentially and waiting for the 3.5s timeout.

Guarding Shared Subprocess State (Issue #453)

When the ingestion engine spawns secondary worker processes to evaluate cryptographic paths, simultaneous writes to local shared state maps risk throwing file collision errors. To solve this:

  • We refactored StateRegister in src/utils/state.py to use a multiprocessing.Lock along with Unix advisory file-locking via the Python standard fcntl module on Linux.
  • This dual lock mechanism ensures process safety across parent-child process chains and separate system-level processes.
  • All modifications to the local state are written atomically to a temporary file, flushed, synced, and renamed via os.replace to prevent file collision errors.

Changes Made

  • Modified src/network/nonce_tracker.py to implement RPCNodeFailoverSupervisor and export the default rpc_supervisor instance.
  • Modified src/network/rpc_client.py to integrate FailoverRouter with RPCNodeFailoverSupervisor and implement clean destructor shutdown behavior.
  • Modified src/utils/state.py to add advisory file-locking and atomic persistence to StateRegister.
  • Modified tests/test_nonce_tracker.py to add tests verifying proactive failover on high latency and node connection failures.
  • Created tests/test_state.py to test the state register basic functions and multiprocess concurrency safety.
  • Created tests/test_rpc_client.py to verify the integration of FailoverRouter with the proactive supervisor.

Testing

The test suite was run using the following command:

PYTHONPATH=. pytest tests/test_nonce_tracker.py tests/test_state.py tests/test_file_sync.py tests/test_tx_manager.py tests/test_rpc_client.py

All 35 tests passed successfully.

Scope Notes

  • This is a backend-only change.
  • No changes to frontend components or smart contracts.
  • No new external dependencies were introduced (uses standard libraries fcntl, tempfile, multiprocessing, and requests).

@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@Vicky08100 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant