Skip to content

Conversation

@mxsm
Copy link
Owner

@mxsm mxsm commented Jan 7, 2026

Which Issue(s) This PR Fixes(Closes)

Fixes #5528

Brief Description

How Did You Test This Change?

Summary by CodeRabbit

  • Chores
    • Implemented graceful shutdown mechanism for the OpenRaft controller with improved lifecycle management. The server now performs clean shutdown sequences with automatic timeout protection, ensuring proper resource cleanup and better operational stability.

✏️ Tip: You can customize this high-level summary in your review settings.

@rocketmq-rust-bot
Copy link
Collaborator

🔊@mxsm 🚀Thanks for your contribution🎉!

💡CodeRabbit(AI) will review your code first🔥!

Note

🚨The code review suggestions from CodeRabbit are to be used as a reference only, and the PR submitter can decide whether to make changes based on their own judgment. Ultimately, the project management personnel will conduct the final code review💥.

@rocketmq-rust-robot rocketmq-rust-robot added the feature🚀 Suggest an idea for this project. label Jan 7, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

Walkthrough

Implemented graceful shutdown and thread-safe state management for OpenRaft controller. Added Arc-wrapped lifecycle components (node, handle, shutdown_tx). Expanded startup to initialize RaftNodeManager and spawn gRPC server task with shutdown channel. Enhanced shutdown to signal server, shut down Raft node, and await task completion with timeout.

Changes

Cohort / File(s) Summary
OpenRaft Controller Lifecycle Management
rocketmq-controller/src/controller/open_raft_controller.rs
Added three Arc<Mutex<...>> fields to public struct OpenRaftController: node (Optional RaftNodeManager), handle (gRPC server JoinHandle), and shutdown_tx (oneshot sender). Expanded startup flow to create RaftNodeManager, spawn gRPC server with serve_with_shutdown and shutdown channel, and store lifecycle components. Enhanced shutdown flow to signal server via oneshot channel, shut down Raft node, await server task with 10s timeout, and add comprehensive logging. Added module-level documentation on graceful shutdown and thread-safety patterns.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Controller as OpenRaftController
    participant NodeMgr as RaftNodeManager
    participant Server as gRPC Server
    
    autonumber
    
    rect rgb(230, 245, 255)
    Note over Caller,Server: Startup Flow
    Caller->>Controller: start()
    Controller->>NodeMgr: Create RaftNodeManager
    Controller->>Server: Create gRPC service
    Controller->>Server: Spawn serve_with_shutdown task
    Server-->>Controller: Return JoinHandle
    Controller->>Controller: Store node, handle, shutdown_tx<br/>(Arc<Mutex>)
    Controller-->>Caller: Startup complete
    end
    
    rect rgb(255, 240, 240)
    Note over Caller,Server: Shutdown Flow
    Caller->>Controller: stop()
    Controller->>Server: Signal via oneshot channel
    Server-->>Server: Graceful shutdown initiated
    Controller->>NodeMgr: Shutdown Raft node
    NodeMgr-->>Controller: Node shutdown complete
    Controller->>Server: Await JoinHandle (10s timeout)
    Server-->>Controller: Server task finished
    Controller-->>Caller: Shutdown complete
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With Arc and Mutex, threads align,
The shutdown gracefully divine,
State locked safe in rabbit care,
Controller waits with patient flair,
Concurrency without despair! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing graceful shutdown and thread-safe state management for the OpenRaft controller, which aligns with the code changes.
Linked Issues check ✅ Passed The code changes implement graceful shutdown (with oneshot channel and server task management) and thread-safe state management (using Arc<Mutex<...>> for lifecycle components), directly addressing the objectives from issue #5528.
Out of Scope Changes check ✅ Passed All changes are focused on the OpenRaft controller's lifecycle management and graceful shutdown mechanisms; no unrelated or out-of-scope modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @rocketmq-controller/src/controller/open_raft_controller.rs:
- Around line 146-173: The shutdown sequence currently shuts down the Raft node
via self.node.lock() and node.shutdown().await before awaiting the server task
stored in self.handle, which can abort in-flight gRPC requests; change the order
to first take and signal/await the server task (take self.handle via
self.handle.lock(), await the handle with tokio::time::timeout and log results),
and only after the server task completes (or times out) then take self.node via
self.node.lock() and call node.shutdown().await; ensure you preserve the
existing logging paths (info/eprintln) and handle Option semantics for both the
handle and node.
🧹 Nitpick comments (2)
rocketmq-controller/src/controller/open_raft_controller.rs (2)

61-71: Consider tokio::sync::Mutex for consistency in async context.

Using parking_lot::Mutex is acceptable here since locks are held briefly without crossing await points. However, if future changes require holding locks across async operations, tokio::sync::Mutex would be safer. The current implementation works correctly as-is.


96-112: Use tracing::error! instead of eprintln! for consistent logging.

The spawned task uses eprintln! for error output (line 108), which bypasses structured logging. This inconsistency makes log aggregation and filtering harder.

♻️ Proposed fix
+use tracing::error;
+
 let handle = tokio::spawn(async move {
     info!("gRPC server starting for node {} on {}", node_id, addr);

     let result = Server::builder()
         .add_service(OpenRaftServiceServer::new(service))
         .serve_with_shutdown(addr, async {
             shutdown_rx.await.ok();
             info!("Shutdown signal received for node {}, stopping gRPC server", node_id);
         })
         .await;

     if let Err(e) = result {
-        eprintln!("gRPC server error for node {}: {}", node_id, e);
+        error!("gRPC server error for node {}: {}", node_id, e);
     } else {
         info!("gRPC server for node {} stopped gracefully", node_id);
     }
 });
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d37420f and 373f91a.

📒 Files selected for processing (1)
  • rocketmq-controller/src/controller/open_raft_controller.rs
🧰 Additional context used
🧬 Code graph analysis (1)
rocketmq-controller/src/controller/open_raft_controller.rs (2)
rocketmq-controller/src/controller.rs (4)
  • startup (183-183)
  • startup (523-525)
  • shutdown (200-200)
  • shutdown (527-529)
rocketmq-controller/src/controller/raft_controller.rs (2)
  • startup (62-67)
  • shutdown (69-74)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build & Test (macos-latest)
  • GitHub Check: Build & Test (ubuntu-latest)
  • GitHub Check: Code Coverage
  • GitHub Check: auto-approve
🔇 Additional comments (3)
rocketmq-controller/src/controller/open_raft_controller.rs (3)

15-21: LGTM!

Clear module-level documentation explaining the key features and design goals.


73-82: LGTM!

Constructor properly initializes all lifecycle fields to None, ready for population during startup().


179-259: Stub methods acknowledged.

These methods are appropriately marked with TODO comments. The PR scope is focused on graceful shutdown and thread-safe state management, which is correctly implemented in startup() and shutdown().

Comment on lines +146 to +173
// Shutdown Raft node
let node = {
let mut node_guard = self.node.lock();
node_guard.take()
};

if let Some(node) = node {
if let Err(e) = node.shutdown().await {
eprintln!("Error shutting down Raft node: {}", e);
} else {
info!("Raft node shutdown successfully");
}
}

// Wait for server task to complete (with timeout)
let handle = {
let mut handle_guard = self.handle.lock();
handle_guard.take()
};

if let Some(handle) = handle {
let timeout = tokio::time::Duration::from_secs(10);
match tokio::time::timeout(timeout, handle).await {
Ok(Ok(_)) => info!("Server task completed successfully"),
Ok(Err(e)) => eprintln!("Server task panicked: {}", e),
Err(_) => eprintln!("Timeout waiting for server task to complete"),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shutdown ordering may cause in-flight requests to fail.

The current order shuts down the Raft node (lines 152-158) before waiting for the gRPC server task to complete (lines 166-173). If the server is still processing in-flight requests when the Raft node shuts down, those requests will fail.

Consider reordering: signal server → wait for server to finish → then shutdown Raft node.

🔧 Proposed fix
     // Take and send shutdown signal to gRPC server
     {
         let mut shutdown_tx_guard = self.shutdown_tx.lock();
         if let Some(tx) = shutdown_tx_guard.take() {
             if tx.send(()).is_err() {
-                eprintln!("Failed to send shutdown signal to gRPC server (receiver dropped)");
+                error!("Failed to send shutdown signal to gRPC server (receiver dropped)");
             } else {
                 info!("Shutdown signal sent to gRPC server");
             }
         }
     }

-    // Shutdown Raft node
-    let node = {
-        let mut node_guard = self.node.lock();
-        node_guard.take()
-    };
-
-    if let Some(node) = node {
-        if let Err(e) = node.shutdown().await {
-            eprintln!("Error shutting down Raft node: {}", e);
-        } else {
-            info!("Raft node shutdown successfully");
-        }
-    }
-
     // Wait for server task to complete (with timeout)
     let handle = {
         let mut handle_guard = self.handle.lock();
         handle_guard.take()
     };

     if let Some(handle) = handle {
         let timeout = tokio::time::Duration::from_secs(10);
         match tokio::time::timeout(timeout, handle).await {
             Ok(Ok(_)) => info!("Server task completed successfully"),
-            Ok(Err(e)) => eprintln!("Server task panicked: {}", e),
-            Err(_) => eprintln!("Timeout waiting for server task to complete"),
+            Ok(Err(e)) => error!("Server task panicked: {}", e),
+            Err(_) => warn!("Timeout waiting for server task to complete"),
         }
     }

+    // Shutdown Raft node after server has stopped accepting requests
+    let node = {
+        let mut node_guard = self.node.lock();
+        node_guard.take()
+    };
+
+    if let Some(node) = node {
+        if let Err(e) = node.shutdown().await {
+            error!("Error shutting down Raft node: {}", e);
+        } else {
+            info!("Raft node shutdown successfully");
+        }
+    }
+
     info!("OpenRaft controller shutdown completed");
     Ok(())

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @rocketmq-controller/src/controller/open_raft_controller.rs around lines 146
- 173, The shutdown sequence currently shuts down the Raft node via
self.node.lock() and node.shutdown().await before awaiting the server task
stored in self.handle, which can abort in-flight gRPC requests; change the order
to first take and signal/await the server task (take self.handle via
self.handle.lock(), await the handle with tokio::time::timeout and log results),
and only after the server task completes (or times out) then take self.node via
self.node.lock() and call node.shutdown().await; ensure you preserve the
existing logging paths (info/eprintln) and handle Option semantics for both the
handle and node.

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 82.81250% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.44%. Comparing base (d37420f) to head (373f91a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...-controller/src/controller/open_raft_controller.rs 82.81% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5529      +/-   ##
==========================================
+ Coverage   38.41%   38.44%   +0.03%     
==========================================
  Files         815      815              
  Lines      110512   110567      +55     
==========================================
+ Hits        42452    42508      +56     
+ Misses      68060    68059       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@rocketmq-rust-bot rocketmq-rust-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - All CI checks passed ✅

@rocketmq-rust-bot rocketmq-rust-bot merged commit 75c1e96 into main Jan 7, 2026
20 checks passed
@rocketmq-rust-bot rocketmq-rust-bot added approved PR has approved and removed ready to review waiting-review waiting review this PR labels Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI review first Ai review pr first approved PR has approved auto merge feature🚀 Suggest an idea for this project.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature🚀] Implement graceful shutdown and thread-safe state management for OpenRaft controller

4 participants