feat(resilience): implement proactive rpc failover and process-safe state locking by Vicky08100 · Pull Request #554 · StellarFlow-Network/stellarflow-backend

Vicky08100 · 2026-06-28T16:01:12Z

Closes #453
Closes #510

PR Description

This pull request implements resilience and concurrency safety improvements to the StellarFlow Backend.

Proactive RPC Failover Management (Issue #510)

Previously, waiting for an active Soroban RPC node connection to time out during critical ledger submission windows caused transaction pipelines to stall. To address this:

We implemented a proactive monitoring supervisor (RPCNodeFailoverSupervisor) in src/network/nonce_tracker.py that runs a background daemon thread checking RPC node health periodically.
It uses the lightweight getHealth JSON-RPC method to measure RTT latency.
If a node's latency exceeds 500ms or fails, the supervisor shifts the active route to the next fastest available healthy node.
The FailoverRouter in src/network/rpc_client.py was refactored to obtain the best active endpoint from the supervisor rather than failing sequentially and waiting for the 3.5s timeout.

Guarding Shared Subprocess State (Issue #453)

When the ingestion engine spawns secondary worker processes to evaluate cryptographic paths, simultaneous writes to local shared state maps risk throwing file collision errors. To solve this:

We refactored StateRegister in src/utils/state.py to use a multiprocessing.Lock along with Unix advisory file-locking via the Python standard fcntl module on Linux.
This dual lock mechanism ensures process safety across parent-child process chains and separate system-level processes.
All modifications to the local state are written atomically to a temporary file, flushed, synced, and renamed via os.replace to prevent file collision errors.

Changes Made

Modified src/network/nonce_tracker.py to implement RPCNodeFailoverSupervisor and export the default rpc_supervisor instance.
Modified src/network/rpc_client.py to integrate FailoverRouter with RPCNodeFailoverSupervisor and implement clean destructor shutdown behavior.
Modified src/utils/state.py to add advisory file-locking and atomic persistence to StateRegister.
Modified tests/test_nonce_tracker.py to add tests verifying proactive failover on high latency and node connection failures.
Created tests/test_state.py to test the state register basic functions and multiprocess concurrency safety.
Created tests/test_rpc_client.py to verify the integration of FailoverRouter with the proactive supervisor.

Testing

The test suite was run using the following command:

PYTHONPATH=. pytest tests/test_nonce_tracker.py tests/test_state.py tests/test_file_sync.py tests/test_tx_manager.py tests/test_rpc_client.py

All 35 tests passed successfully.

Scope Notes

This is a backend-only change.
No changes to frontend components or smart contracts.
No new external dependencies were introduced (uses standard libraries fcntl, tempfile, multiprocessing, and requests).

drips-wave · 2026-06-28T16:01:22Z

@Vicky08100 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

feat: implement proactive rpc failover and process-safe state locking

98a605c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(resilience): implement proactive rpc failover and process-safe state locking#554

feat(resilience): implement proactive rpc failover and process-safe state locking#554
Vicky08100 wants to merge 1 commit into
StellarFlow-Network:mainfrom
Vicky08100:feature/resilience-optimizations

Vicky08100 commented Jun 28, 2026

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Vicky08100 commented Jun 28, 2026

PR Description

Proactive RPC Failover Management (Issue #510)

Guarding Shared Subprocess State (Issue #453)

Changes Made

Testing

Scope Notes

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant