refactor: use actor model for rust-libp2p bridge by zclawz · Pull Request #670 · blockblaz/zeam

zclawz · 2026-03-13T08:06:43Z

Closes #668

Summary

Fixes the data race on the libp2p Swarm object that caused the node to stall and stop receiving gossip messages while connections remained alive.

Root cause: Five FFI functions called from the Zig thread accessed static mut SWARM_STATE via get_swarm_mut() without synchronization, while the Tokio event loop on a separate thread held a mutable reference to the same swarm. This corrupted internal gossipsub state.

Fix: Replace direct swarm mutation from FFI functions with a tokio::sync::mpsc command channel per network. The Tokio event loop is now the only place that mutates the Swarm.

Changes

Add SwarmCommand enum with variants for all FFI-driven swarm operations: Publish, SendRpcRequest, SendRpcResponseChunk, SendRpcEndOfStream, SendRpcErrorResponse
Add COMMAND_SENDERS: Mutex<HashMap<u32, UnboundedSender<SwarmCommand>>> (per network_id) registered in start_network
Add send_swarm_command() helper used by all FFI functions
Store UnboundedReceiver<SwarmCommand> in Network.cmd_rx field
run_eventloop now selects on cmd_rx.recv() and executes commands, keeping all swarm mutations on the Tokio thread
FFI functions (publish_msg_to_rust_bridge, send_rpc_request, send_rpc_response_chunk, send_rpc_end_of_stream, send_rpc_error_response) now push commands instead of calling get_swarm_mut()

Testing

cargo test passes (all 4 Rust unit tests)
zig build test (pre-existing io_uring sandbox failures unrelated to this change)

Replace direct static-mut Swarm access from FFI functions with a tokio::sync::mpsc command channel. The Tokio event loop is now the only place that mutates the Swarm, eliminating the data race between FFI callers on the Zig thread and the Tokio event loop thread. Changes: - Add SwarmCommand enum with variants for all FFI-driven swarm ops: Publish, SendRpcRequest, SendRpcResponseChunk, SendRpcEndOfStream, SendRpcErrorResponse - Add COMMAND_SENDERS: Mutex<HashMap<u32, UnboundedSender<SwarmCommand>>> (per network_id) registered in start_network - Add send_swarm_command() helper used by all FFI functions - Store UnboundedReceiver<SwarmCommand> in Network.cmd_rx field - run_eventloop now selects on cmd_rx.recv() and executes commands, keeping all swarm mutations on the Tokio thread - FFI functions (publish_msg_to_rust_bridge, send_rpc_request, send_rpc_response_chunk, send_rpc_end_of_stream, send_rpc_error_response) now push commands instead of calling get_swarm_mut() - Update test to use #[tokio::test] and test invalid peer_id → 0 path All Rust tests pass. The SWARM_STATE static remains but is now only accessed from run_eventloop (single thread).

GrapeBaBa · 2026-03-13T08:20:46Z

@codex review

Copilot

Pull request overview

This PR refactors the Rust libp2p FFI bridge to an actor-model design, ensuring the Tokio event loop is the sole mutator of the libp2p Swarm to eliminate a cross-thread data race that could stall gossip.

Changes:

Introduces SwarmCommand plus a per-network_id command channel (COMMAND_SENDERS) and send_swarm_command() helper.
Updates FFI entrypoints (publish + req/resp send APIs) to enqueue commands instead of directly mutating the swarm.
Extends Network to own a command receiver and processes commands inside run_eventloop via tokio::select!.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

GrapeBaBa · 2026-03-13T08:58:24Z

rust/libp2p-glue/src/lib.rs

    let request_id = REQUEST_ID_COUNTER.fetch_add(1, Ordering::Relaxed) + 1;
-
    let request_message = RequestMessage::new(protocol_id.clone(), request_bytes);

-    swarm
-        .behaviour_mut()
-        .reqresp
-        .send_request(peer_id, request_id, request_message);
-
+    // Register tracking state before sending the command so the event loop handler
+    // sees the entries if the response arrives quickly.
    REQUEST_ID_MAP.lock().unwrap().insert(request_id, ());
-    REQUEST_PROTOCOL_MAP
-        .lock()
-        .unwrap()
-        .insert(request_id, protocol_id.clone());
+    REQUEST_PROTOCOL_MAP.lock().unwrap().insert(request_id, protocol_id.clone());
+
+    send_swarm_command(network_id, SwarmCommand::SendRpcRequest {
+        peer_id,
+        request_id,
+        protocol_id,
+        request_message,
+    });


@zclawz this is correct concern

Fixed in commits c33b450 + b3c2e14. send_rpc_request now pre-inserts tracking state (as before) but rolls back both REQUEST_ID_MAP and REQUEST_PROTOCOL_MAP entries if send_swarm_command returns false.