Skip to content

Conversation

@andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Oct 28, 2025

This builds on #9258

NodeTask now uses the trust_quorum_protocol::Node and trust_quorum_protocol::NodeCtx to send and receive trust quorum messages. An API to drive this was added to the NodeTaskHandle.

The majority of code in this PR is tests using the API.

A follow up will deal with saving persistent state to a Ledger.

@andrewjstone andrewjstone force-pushed the tq-sprockets-2 branch 2 times, most recently from 45450e3 to 4e7f80b Compare October 28, 2025 17:56
`NodeTask` now uses the `trust_quorum_protocol::Node` and
`trust_quorum_protocol::NodeCtx` to send and receive trust quorum
messages. An API to drive this was added to the `NodeTaskHandle`.

The majority of code in this PR is tests using the API.

A follow up will deal with saving persistent state to a Ledger.
andrewjstone added a commit that referenced this pull request Oct 29, 2025
Builds on #9296

This commit persists state to a ledger, following the pattern used in
the bootstore. It's done this way because the `PersistentState` itself
is contained in the sans-io layer, but we must save it in the async task
layer. The sans-io layer shouldn't know how the state is persisted, just
that it  is, and so we recreate the ledger for every time we write it.

A follow up will PR will deal with the early networking information saved
by the bootstore, and will be very similar.
andrewjstone added a commit that referenced this pull request Oct 29, 2025
Builds on #9296

This commit persists state to a ledger, following the pattern used in
the bootstore. It's done this way because the `PersistentState` itself
is contained in the sans-io layer, but we must save it in the async task
layer. The sans-io layer shouldn't know how the state is persisted, just
that it  is, and so we recreate the ledger for every time we write it.

A follow up will PR will deal with the early networking information saved
by the bootstore, and will be very similar.
@andrewjstone andrewjstone mentioned this pull request Oct 29, 2025
19 tasks
///
/// This can block for an indefinite period of time before returning
/// and depends on availability of the trust quorum.
pub async fn load_rack_secret(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a try_load_rack_secret that only tries once? Also should this log on retry? I'm a little worried this will be a bit of a black box that'll be hard to inspect in case it takes too long.

Copy link
Contributor Author

@andrewjstone andrewjstone Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was that the caller would wrap this in their own timeout, rather than passing one in or retrying. However, it's different enough from everything else that this probably isn't warranted. I actually forgot about it at one point when writing tests and caused a hang. I should probably just not retry internally and return the Result<Option<ReconstructedRackSecret>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look too closely at the tests, but the code itself looks great! just a few minor comments that I'll trust you to resolve :)

Comment on lines +264 to +270
/// Return `Ok(true)` if the configuration has committed, `Ok(false)` if
/// it hasn't committed yet, or an error otherwise.
///
/// Nexus will retry this operation and so we should only try once here.
/// This is in contrast to operations like `load_rack_secret` that are
/// called directly from sled agent.
pub async fn prepare_and_commit(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you return a two-valued enum here rather than bool?

What is the "otherwise" in "return an error otherwise" here? Just send and receive errors or something else?

Also since this doesn't loop I'd consider calling this try_prepare_and_commit. not relevant if we change load_rack_secret to not retry.

/// Nexus will retry this operation and so we should only try once here.
/// This is in contrast to operations like `load_rack_secret` that are
/// called directly from sled agent.
pub async fn commit(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above.

Ok(res)
}

pub async fn status(&self) -> Result<NodeStatus, NodeApiError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a doc comment here?

///
/// This can block for an indefinite period of time before returning
/// and depends on availability of the trust quorum.
pub async fn load_rack_secret(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense.

Comment on lines +419 to 421
for envelope in self.ctx.drain_envelopes() {
self.conn_mgr.send(envelope).await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do this concurrently, or is serially okay? I guess this shouldn't be cancelled since there's an instruction to make run a top-level task.

}
}

// TODO: Process `ctx`: save persistent state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's ctx here?

Comment on lines +108 to +117
/// Return the status of this node if it is a coordinator
CoordinatorStatus { responder: oneshot::Sender<Option<CoordinatorStatus>> },

/// Load a rack secret for the given epoch
LoadRackSecret {
epoch: Epoch,
responder: oneshot::Sender<
Result<Option<ReconstructedRackSecret>, LoadRackSecretError>,
>,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would consider calling all of the oneshot channels tx or similar

Comment on lines +936 to +937
&poll_interval,
&poll_max,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, honestly this should take a Duration, not a reference to it. Worth fixing at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants