[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state management for OpenRaft controller #5529

mxsm · 2026-01-07T08:29:14Z

Which Issue(s) This PR Fixes(Closes)

Fixes #5528

Brief Description

How Did You Test This Change?

Summary by CodeRabbit

Chores
- Implemented graceful shutdown mechanism for the OpenRaft controller with improved lifecycle management. The server now performs clean shutdown sequences with automatic timeout protection, ensuring proper resource cleanup and better operational stability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ement for OpenRaft controller

rocketmq-rust-bot · 2026-01-07T08:29:26Z

🔊@mxsm 🚀Thanks for your contribution🎉！

💡CodeRabbit(AI) will review your code first🔥！

Note

🚨The code review suggestions from CodeRabbit are to be used as a reference only, and the PR submitter can decide whether to make changes based on their own judgment. Ultimately, the project management personnel will conduct the final code review💥.

coderabbitai · 2026-01-07T08:29:27Z

Walkthrough

Implemented graceful shutdown and thread-safe state management for OpenRaft controller. Added Arc-wrapped lifecycle components (node, handle, shutdown_tx). Expanded startup to initialize RaftNodeManager and spawn gRPC server task with shutdown channel. Enhanced shutdown to signal server, shut down Raft node, and await task completion with timeout.

Changes

Cohort / File(s)	Summary
OpenRaft Controller Lifecycle Management `rocketmq-controller/src/controller/open_raft_controller.rs`	Added three Arc<Mutex<...>> fields to public struct OpenRaftController: `node` (Optional RaftNodeManager), `handle` (gRPC server JoinHandle), and `shutdown_tx` (oneshot sender). Expanded startup flow to create RaftNodeManager, spawn gRPC server with serve_with_shutdown and shutdown channel, and store lifecycle components. Enhanced shutdown flow to signal server via oneshot channel, shut down Raft node, await server task with 10s timeout, and add comprehensive logging. Added module-level documentation on graceful shutdown and thread-safety patterns.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Controller as OpenRaftController
    participant NodeMgr as RaftNodeManager
    participant Server as gRPC Server
    
    autonumber
    
    rect rgb(230, 245, 255)
    Note over Caller,Server: Startup Flow
    Caller->>Controller: start()
    Controller->>NodeMgr: Create RaftNodeManager
    Controller->>Server: Create gRPC service
    Controller->>Server: Spawn serve_with_shutdown task
    Server-->>Controller: Return JoinHandle
    Controller->>Controller: Store node, handle, shutdown_tx<br/>(Arc<Mutex>)
    Controller-->>Caller: Startup complete
    end
    
    rect rgb(255, 240, 240)
    Note over Caller,Server: Shutdown Flow
    Caller->>Controller: stop()
    Controller->>Server: Signal via oneshot channel
    Server-->>Server: Graceful shutdown initiated
    Controller->>NodeMgr: Shutdown Raft node
    NodeMgr-->>Controller: Node shutdown complete
    Controller->>Server: Await JoinHandle (10s timeout)
    Server-->>Controller: Server task finished
    Controller-->>Caller: Shutdown complete
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With Arc and Mutex, threads align,
The shutdown gracefully divine,
State locked safe in rabbit care,
Controller waits with patient flair,
Concurrency without despair! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing graceful shutdown and thread-safe state management for the OpenRaft controller, which aligns with the code changes.
Linked Issues check	✅ Passed	The code changes implement graceful shutdown (with oneshot channel and server task management) and thread-safe state management (using Arc<Mutex<...>> for lifecycle components), directly addressing the objectives from issue #5528.
Out of Scope Changes check	✅ Passed	All changes are focused on the OpenRaft controller's lifecycle management and graceful shutdown mechanisms; no unrelated or out-of-scope modifications are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @rocketmq-controller/src/controller/open_raft_controller.rs:
- Around line 146-173: The shutdown sequence currently shuts down the Raft node
via self.node.lock() and node.shutdown().await before awaiting the server task
stored in self.handle, which can abort in-flight gRPC requests; change the order
to first take and signal/await the server task (take self.handle via
self.handle.lock(), await the handle with tokio::time::timeout and log results),
and only after the server task completes (or times out) then take self.node via
self.node.lock() and call node.shutdown().await; ensure you preserve the
existing logging paths (info/eprintln) and handle Option semantics for both the
handle and node.

🧹 Nitpick comments (2)

rocketmq-controller/src/controller/open_raft_controller.rs (2)
61-71: Consider tokio::sync::Mutex for consistency in async context.

Using parking_lot::Mutex is acceptable here since locks are held briefly without crossing await points. However, if future changes require holding locks across async operations, tokio::sync::Mutex would be safer. The current implementation works correctly as-is.

96-112: Use tracing::error! instead of eprintln! for consistent logging.

The spawned task uses eprintln! for error output (line 108), which bypasses structured logging. This inconsistency makes log aggregation and filtering harder.
♻️ Proposed fix
+use tracing::error;
+
 let handle = tokio::spawn(async move {
     info!("gRPC server starting for node {} on {}", node_id, addr);

     let result = Server::builder()
         .add_service(OpenRaftServiceServer::new(service))
         .serve_with_shutdown(addr, async {
             shutdown_rx.await.ok();
             info!("Shutdown signal received for node {}, stopping gRPC server", node_id);
         })
         .await;

     if let Err(e) = result {
-        eprintln!("gRPC server error for node {}: {}", node_id, e);
+        error!("gRPC server error for node {}: {}", node_id, e);
     } else {
         info!("gRPC server for node {} stopped gracefully", node_id);
     }
 });

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d37420f and 373f91a.

📒 Files selected for processing (1)

rocketmq-controller/src/controller/open_raft_controller.rs

🧰 Additional context used

🧬 Code graph analysis (1)

rocketmq-controller/src/controller/open_raft_controller.rs (2)

rocketmq-controller/src/controller.rs (4)

startup (183-183)

startup (523-525)

shutdown (200-200)

shutdown (527-529)

rocketmq-controller/src/controller/raft_controller.rs (2)

startup (62-67)

shutdown (69-74)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build & Test (macos-latest)
GitHub Check: Build & Test (ubuntu-latest)
GitHub Check: Code Coverage
GitHub Check: auto-approve

🔇 Additional comments (3)

rocketmq-controller/src/controller/open_raft_controller.rs (3)

15-21: LGTM!

Clear module-level documentation explaining the key features and design goals.

73-82: LGTM!

Constructor properly initializes all lifecycle fields to None, ready for population during startup().

179-259: Stub methods acknowledged.

These methods are appropriately marked with TODO comments. The PR scope is focused on graceful shutdown and thread-safe state management, which is correctly implemented in startup() and shutdown().

coderabbitai · 2026-01-07T08:32:26Z

rocketmq-controller/src/controller/open_raft_controller.rs

+        // Shutdown Raft node
+        let node = {
+            let mut node_guard = self.node.lock();
+            node_guard.take()
+        };
+
+        if let Some(node) = node {
+            if let Err(e) = node.shutdown().await {
+                eprintln!("Error shutting down Raft node: {}", e);
+            } else {
+                info!("Raft node shutdown successfully");
+            }
+        }
+
+        // Wait for server task to complete (with timeout)
+        let handle = {
+            let mut handle_guard = self.handle.lock();
+            handle_guard.take()
+        };
+
+        if let Some(handle) = handle {
+            let timeout = tokio::time::Duration::from_secs(10);
+            match tokio::time::timeout(timeout, handle).await {
+                Ok(Ok(_)) => info!("Server task completed successfully"),
+                Ok(Err(e)) => eprintln!("Server task panicked: {}", e),
+                Err(_) => eprintln!("Timeout waiting for server task to complete"),
+            }
+        }


⚠️ Potential issue | 🟠 Major

Shutdown ordering may cause in-flight requests to fail.

The current order shuts down the Raft node (lines 152-158) before waiting for the gRPC server task to complete (lines 166-173). If the server is still processing in-flight requests when the Raft node shuts down, those requests will fail.

Consider reordering: signal server → wait for server to finish → then shutdown Raft node.

🔧 Proposed fix

// Take and send shutdown signal to gRPC server { let mut shutdown_tx_guard = self.shutdown_tx.lock(); if let Some(tx) = shutdown_tx_guard.take() { if tx.send(()).is_err() { - eprintln!("Failed to send shutdown signal to gRPC server (receiver dropped)"); + error!("Failed to send shutdown signal to gRPC server (receiver dropped)"); } else { info!("Shutdown signal sent to gRPC server"); } } } - // Shutdown Raft node - let node = { - let mut node_guard = self.node.lock(); - node_guard.take() - }; - - if let Some(node) = node { - if let Err(e) = node.shutdown().await { - eprintln!("Error shutting down Raft node: {}", e); - } else { - info!("Raft node shutdown successfully"); - } - } - // Wait for server task to complete (with timeout) let handle = { let mut handle_guard = self.handle.lock(); handle_guard.take() }; if let Some(handle) = handle { let timeout = tokio::time::Duration::from_secs(10); match tokio::time::timeout(timeout, handle).await { Ok(Ok(_)) => info!("Server task completed successfully"), - Ok(Err(e)) => eprintln!("Server task panicked: {}", e), - Err(_) => eprintln!("Timeout waiting for server task to complete"), + Ok(Err(e)) => error!("Server task panicked: {}", e), + Err(_) => warn!("Timeout waiting for server task to complete"), } } + // Shutdown Raft node after server has stopped accepting requests + let node = { + let mut node_guard = self.node.lock(); + node_guard.take() + }; + + if let Some(node) = node { + if let Err(e) = node.shutdown().await { + error!("Error shutting down Raft node: {}", e); + } else { + info!("Raft node shutdown successfully"); + } + } + info!("OpenRaft controller shutdown completed"); Ok(())

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In @rocketmq-controller/src/controller/open_raft_controller.rs around lines 146 - 173, The shutdown sequence currently shuts down the Raft node via self.node.lock() and node.shutdown().await before awaiting the server task stored in self.handle, which can abort in-flight gRPC requests; change the order to first take and signal/await the server task (take self.handle via self.handle.lock(), await the handle with tokio::time::timeout and log results), and only after the server task completes (or times out) then take self.node via self.node.lock() and call node.shutdown().await; ensure you preserve the existing logging paths (info/eprintln) and handle Option semantics for both the handle and node.

codecov · 2026-01-07T08:36:39Z

Codecov Report

❌ Patch coverage is 82.81250% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.44%. Comparing base (d37420f) to head (373f91a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...-controller/src/controller/open_raft_controller.rs	82.81%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5529      +/-   ##
==========================================
+ Coverage   38.41%   38.44%   +0.03%     
==========================================
  Files         815      815              
  Lines      110512   110567      +55     
==========================================
+ Hits        42452    42508      +56     
+ Misses      68060    68059       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rocketmq-rust-bot

LGTM - All CI checks passed ✅

[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state manag…

373f91a

…ement for OpenRaft controller

rocketmq-rust-bot requested review from SpaceXCN, TeslaRustor and rocketmq-rust-bot January 7, 2026 08:29

rocketmq-rust-bot added AI review first Ai review pr first auto merge ready to review waiting-review waiting review this PR labels Jan 7, 2026

rocketmq-rust-robot added the feature🚀 Suggest an idea for this project. label Jan 7, 2026

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

rocketmq-rust-bot approved these changes Jan 7, 2026

View reviewed changes

rocketmq-rust-bot merged commit 75c1e96 into main Jan 7, 2026
20 checks passed

rocketmq-rust-bot added approved PR has approved and removed ready to review waiting-review waiting review this PR labels Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state management for OpenRaft controller #5529

[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state management for OpenRaft controller #5529

Uh oh!

mxsm commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

rocketmq-rust-bot commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

rocketmq-rust-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state management for OpenRaft controller #5529

[ISSUE #5528]🚀Implement graceful shutdown and thread-safe state management for OpenRaft controller #5529

Uh oh!

Conversation

mxsm commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which Issue(s) This PR Fixes(Closes)

Brief Description

How Did You Test This Change?

Summary by CodeRabbit

Uh oh!

rocketmq-rust-bot commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rocketmq-rust-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mxsm commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading