TQ: Integrate protocol with `NodeTask` #9296

andrewjstone · 2025-10-28T17:45:50Z

This builds on #9258

NodeTask now uses the trust_quorum_protocol::Node and trust_quorum_protocol::NodeCtx to send and receive trust quorum messages. An API to drive this was added to the NodeTaskHandle.

The majority of code in this PR is tests using the API.

A follow up will deal with saving persistent state to a Ledger.

`NodeTask` now uses the `trust_quorum_protocol::Node` and `trust_quorum_protocol::NodeCtx` to send and receive trust quorum messages. An API to drive this was added to the `NodeTaskHandle`. The majority of code in this PR is tests using the API. A follow up will deal with saving persistent state to a Ledger.

Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

sunshowers · 2025-10-30T21:30:09Z

trust-quorum/src/task.rs

+    ///
+    /// This can block for an indefinite period of time before returning
+    /// and depends on availability of the trust quorum.
+    pub async fn load_rack_secret(


Do we want a try_load_rack_secret that only tries once? Also should this log on retry? I'm a little worried this will be a bit of a black box that'll be hard to inspect in case it takes too long.

My intention was that the caller would wrap this in their own timeout, rather than passing one in or retrying. However, it's different enough from everything else that this probably isn't warranted. I actually forgot about it at one point when writing tests and caused a hang. I should probably just not retry internally and return the Result<Option<ReconstructedRackSecret>>

I think that would make sense.

sunshowers

Didn't look too closely at the tests, but the code itself looks great! just a few minor comments that I'll trust you to resolve :)

sunshowers · 2025-10-30T21:33:50Z

trust-quorum/src/task.rs

+    /// Return `Ok(true)` if the configuration has committed, `Ok(false)` if
+    /// it hasn't committed yet, or an error otherwise.
+    ///
+    /// Nexus will retry this operation and so we should only try once here.
+    /// This is in contrast to operations like `load_rack_secret` that are
+    /// called directly from sled agent.
+    pub async fn prepare_and_commit(


Could you return a two-valued enum here rather than bool?

What is the "otherwise" in "return an error otherwise" here? Just send and receive errors or something else?

~~Also since this doesn't loop I'd consider calling this try_prepare_and_commit.~~ not relevant if we change load_rack_secret to not retry.

sunshowers · 2025-10-30T21:34:12Z

trust-quorum/src/task.rs

+    /// Nexus will retry this operation and so we should only try once here.
+    /// This is in contrast to operations like `load_rack_secret` that are
+    /// called directly from sled agent.
+    pub async fn commit(


Same comments as above.

sunshowers · 2025-10-30T21:34:45Z

trust-quorum/src/task.rs

        Ok(res)
    }

+    pub async fn status(&self) -> Result<NodeStatus, NodeApiError> {


Add a doc comment here?

sunshowers · 2025-10-30T21:45:45Z

trust-quorum/src/task.rs

+    ///
+    /// This can block for an indefinite period of time before returning
+    /// and depends on availability of the trust quorum.
+    pub async fn load_rack_secret(


I think that would make sense.

sunshowers · 2025-10-30T21:48:41Z

trust-quorum/src/task.rs

+            for envelope in self.ctx.drain_envelopes() {
+                self.conn_mgr.send(envelope).await;
            }


do we want to do this concurrently, or is serially okay? I guess this shouldn't be cancelled since there's an instruction to make run a top-level task.

sunshowers · 2025-10-30T21:49:47Z

trust-quorum/src/task.rs

        }
    }

+    // TODO: Process `ctx`: save persistent state


What's ctx here?

sunshowers · 2025-10-30T21:50:42Z

trust-quorum/src/task.rs

+    /// Return the status of this node if it is a coordinator
+    CoordinatorStatus { responder: oneshot::Sender<Option<CoordinatorStatus>> },
+
+    /// Load a rack secret for the given epoch
+    LoadRackSecret {
+        epoch: Epoch,
+        responder: oneshot::Sender<
+            Result<Option<ReconstructedRackSecret>, LoadRackSecretError>,
+        >,
+    },


would consider calling all of the oneshot channels tx or similar

sunshowers · 2025-10-31T02:31:17Z

trust-quorum/src/task.rs

+            &poll_interval,
+            &poll_max,


hmm, honestly this should take a Duration, not a reference to it. Worth fixing at some point.

andrewjstone force-pushed the tq-sprockets-2 branch 2 times, most recently from 45450e3 to 4e7f80b Compare October 28, 2025 17:56

andrewjstone force-pushed the tq-sprockets-2 branch from 4e7f80b to a505cda Compare October 28, 2025 18:24

andrewjstone mentioned this pull request Oct 29, 2025

TQ: Support persisting state to ledger #9310

Open

andrewjstone mentioned this pull request Oct 29, 2025

Trust Quorum Tracking #8262

Open

19 tasks

andrewjstone requested a review from pietroalbini October 30, 2025 14:59

sunshowers reviewed Oct 30, 2025

View reviewed changes

sunshowers approved these changes Oct 31, 2025

View reviewed changes

TQ: Integrate protocol with NodeTask #9296

Are you sure you want to change the base?

TQ: Integrate protocol with NodeTask #9296

Uh oh!

Conversation

andrewjstone commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TQ: Integrate protocol with `NodeTask` #9296

TQ: Integrate protocol with `NodeTask` #9296

andrewjstone commented Oct 28, 2025 •

edited

Loading

andrewjstone Oct 30, 2025 •

edited

Loading