- 
                Notifications
    You must be signed in to change notification settings 
- Fork 60
          TQ: Integrate protocol with NodeTask
          #9296
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tq-sprockets
Are you sure you want to change the base?
Conversation
45450e3    to
    4e7f80b      
    Compare
  
    `NodeTask` now uses the `trust_quorum_protocol::Node` and `trust_quorum_protocol::NodeCtx` to send and receive trust quorum messages. An API to drive this was added to the `NodeTaskHandle`. The majority of code in this PR is tests using the API. A follow up will deal with saving persistent state to a Ledger.
4e7f80b    to
    a505cda      
    Compare
  
    Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.
Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.
| /// | ||
| /// This can block for an indefinite period of time before returning | ||
| /// and depends on availability of the trust quorum. | ||
| pub async fn load_rack_secret( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a try_load_rack_secret that only tries once? Also should this log on retry? I'm a little worried this will be a bit of a black box that'll be hard to inspect in case it takes too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was that the caller would wrap this in their own timeout, rather than passing one in or retrying. However, it's different enough from everything else that this probably isn't warranted. I actually forgot about it at one point when writing tests and caused a hang. I should probably just not retry internally and return the Result<Option<ReconstructedRackSecret>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't look too closely at the tests, but the code itself looks great! just a few minor comments that I'll trust you to resolve :)
| /// Return `Ok(true)` if the configuration has committed, `Ok(false)` if | ||
| /// it hasn't committed yet, or an error otherwise. | ||
| /// | ||
| /// Nexus will retry this operation and so we should only try once here. | ||
| /// This is in contrast to operations like `load_rack_secret` that are | ||
| /// called directly from sled agent. | ||
| pub async fn prepare_and_commit( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you return a two-valued enum here rather than bool?
What is the "otherwise" in "return an error otherwise" here? Just send and receive errors or something else?
Also since this doesn't loop I'd consider calling this  not relevant if we change try_prepare_and_commit.load_rack_secret to not retry.
| /// Nexus will retry this operation and so we should only try once here. | ||
| /// This is in contrast to operations like `load_rack_secret` that are | ||
| /// called directly from sled agent. | ||
| pub async fn commit( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as above.
| Ok(res) | ||
| } | ||
|  | ||
| pub async fn status(&self) -> Result<NodeStatus, NodeApiError> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a doc comment here?
| /// | ||
| /// This can block for an indefinite period of time before returning | ||
| /// and depends on availability of the trust quorum. | ||
| pub async fn load_rack_secret( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make sense.
| for envelope in self.ctx.drain_envelopes() { | ||
| self.conn_mgr.send(envelope).await; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do this concurrently, or is serially okay? I guess this shouldn't be cancelled since there's an instruction to make run a top-level task.
| } | ||
| } | ||
|  | ||
| // TODO: Process `ctx`: save persistent state | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's ctx here?
| /// Return the status of this node if it is a coordinator | ||
| CoordinatorStatus { responder: oneshot::Sender<Option<CoordinatorStatus>> }, | ||
|  | ||
| /// Load a rack secret for the given epoch | ||
| LoadRackSecret { | ||
| epoch: Epoch, | ||
| responder: oneshot::Sender< | ||
| Result<Option<ReconstructedRackSecret>, LoadRackSecretError>, | ||
| >, | ||
| }, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would consider calling all of the oneshot channels tx or similar
| &poll_interval, | ||
| &poll_max, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, honestly this should take a Duration, not a reference to it. Worth fixing at some point.
This builds on #9258
NodeTasknow uses thetrust_quorum_protocol::Nodeandtrust_quorum_protocol::NodeCtxto send and receive trust quorum messages. An API to drive this was added to theNodeTaskHandle.The majority of code in this PR is tests using the API.
A follow up will deal with saving persistent state to a Ledger.