Planner wait conditions #8453

plotnick · 2025-06-25T22:09:28Z

Fixes #8284, hoping to also (eventually) fix #8265, #8298, and the bug surfaced by the test in #8441. Takes the approach mentioned in #8298 and enforces planning steps/phases at compile-time. Also exposes wait conditions via the blueprint_planner background task (not really tested or tuned).

smklein · 2025-06-25T22:26:19Z

nexus/types/src/deployment.rs

+        zone_id: OmicronZoneUuid,
+        reason: ZoneExpungeReason,
+    },
+    ZoneUpdate {


We should be explicit about the timing semantics; this means "a zone with the updated source has appeared in inventory", correct?

Yes, exactly. Clarified with doc-comments in 386f3b5, but maybe we need better variant names?

smklein · 2025-06-25T22:34:02Z

nexus/reconfigurator/planning/src/planner.rs

+planner_step!(MgsUpdatesStep, UpdateZonesStep);
+planner_step!(UpdateZonesStep, CockroachDbSettingsStep);
+planner_step!(CockroachDbSettingsStep, TerminalStep);
+planner_step!(TerminalStep, Never);


We seem to be generating an impl PlannerStep for TerminalStep that returns a Never struct, but is Never actually defined anywhere?

Wow, that macro was really broken, thanks for asking! It turned out that second argument was improperly being used as a type parameter name, and so didn't matter at all; Never was me mis-remembering the name of the (experimental) type called !. Fixed to actually do what I intended along with (a simple version of) your "ready tokens" idea in efd37e9.

nexus/reconfigurator/planning/src/planner.rs

smklein · 2025-06-25T22:42:24Z

nexus/reconfigurator/planning/src/planner.rs

-    fn do_plan_decommission(&mut self) -> Result<(), Error> {
+    fn do_plan_decommission(
+        &mut self,
+        prev: AddStep,


This is more of a readability / refactorability nitpick, but I think (hope?) it'll be helpful:

As you've currently structured these steps, you need to provide:

LastStep as input, and the output is UpdateStepResult<CurrentStep>

If we want to re-order steps, or make some steps concurrent in the future, we'll need to restructure this.

What do you think about re-framing this as:

Input: a token that acts as "ready for current step"

Output: same as it currently is

So, for example, this could be:

fn do_plan_decommission( &mut self, step: DecommissionStepReady, ) -> Result<UpdateStepResult<DecommissionStep>, Error> { ... }

Then, the macros you've created above can create this DecommisionStepReady when we've finished the AddStep, and we could basically only need to define the order in one spot.

Thank you, implemented in efd37e9.

Co-authored-by: Sean Klein <[email protected]>

jgallagher · 2025-06-26T14:41:16Z

nexus/reconfigurator/planning/src/planner.rs

+        let ready = step!(self.do_plan_mgs_updates(ready.into_next()));
+        let ready = step!(self.do_plan_zone_updates(ready.into_next())?);
+        step!(self.do_plan_cockroachdb_settings(ready.into_next())?);
+        unreachable!("should have reached a terminal state already!");


Apologies for the vague comments - complaints without suggestions for fixing them - but you asked for some high level feedback! There are a few things about this that give me pause:

Could any step choose to return PlanningComplete? It looks like right now, the cockroach settings step implicitly knows that it's the last step; if we were to add another step after that, would we have to go back and change that step to return ContinueToNextStep instead?

(Probably strongly related to the previous point) Do we need an explicit TerminalStep? Or, if we do, can we enforce that the terminal step is the only step that means "planning complete"? Ideally in a way that means we don't need this unreachable!().

Is it sensible for a step to return Waiting(vec![]) (i.e., waiting with an empty vec)? I've wished for a nonempty vec type before. There are a bunch of options here; three are: say "just don't do that", use or create some NonEmptyVec type, or make it something like Waiting(WaitCondition, Vec<WaitCondition>) (so at least one condition is required, and you can optionally add others).

My biggest concern: this is adding a system where any planning step can force all subsequent steps to be skipped. I'm not sure the dependencies between steps are that straightforward. It's definitely an incorrect change given the way steps are currently ordered (e.g., just because we have MGS updates to do doesn't mean we shouldn't apply cockroach db settings; on main today, MGS updates existing only skip zone updates, not cockroach settings). Maybe we could address this by reordering the existing steps? But it seems weird that do_plan_add() could cause us to skip do_plan_decommission(); maybe do_plan_add() only really wants to cause do_zone_updates() to be skipped, but decommissioning or MGS updates would be fine?

I think all of these concerns are addressable in the framework here except for your biggest one. If we want to encode control-flow information in a uniform return type, I don't see how we can prevent one step from skipping the rest. Sometimes that's what we'll want: in #8456, for instance, if do_plan_mupdate_override() returns Waiting, it skips all subsequent steps (except, as you noted, for cockroach settings, which I also guess could be moved earlier). That PR adds explicit control flow in do_plan; this PR tries to go the other way and make the step functions control the flow and their invocation more uniform. But maybe we don't want that! I'm not married to the approach or the direction.

The code you sketched in this comment on #8298 shows another possible direction, where we accumulate information from previous steps and pass it all to subsequent ones to let them make local, explicit control-flow decisions. That seems fine, too, though I worry somewhat about scaling that approach as we add more steps.

This gets at an important question: who decides that because step N is waiting for something that some later step M must not run? Is it: step N, step M, or the code that calls both of them? Obviously it can be any of these, but some choices here seem better than others.

It seems fragile to say that step N gets to decide that subsequent steps cannot run because it involves a lot of non-local reasoning:

When adding any step, you need to check the code of all previous steps to figure out if they might cause your step not to run, and if so, whether that's correct.

When modifying any step to either start returning the "stop" value or stop returning it, you need to inspect the code of all subsequent steps to make sure that's still right.

In both cases, this gets worse as you add more steps. We might even run into situations where we want some subsequent steps to run but not others and it's not clear how to communicate that.

It seems simpler to say that step M gets to decide if it runs, given some information about what previous steps did. This way, the reasoning involving each step is in one place. That said, it still means that when working on step M, you need to think about the behavior of all the previous steps. But hopefully that's a little explicit in whatever information we provide about what they did?

But I still feel like the clearest option is to encode this decision in explicit control flow of the function that invokes all the steps. This way, I hope the steps themselves could be largely orthogonal, without thinking much about the other steps. The logic of when to bail out or proceed is all encoded in one place (do_plan(), basically).

On the specific question of WaitConditions, I wonder about an interface like:

struct PlanningStepReporter { ... } impl PlanningStepReporter { /// invoked by a step implementation to say: "I cannot make forward progress until this thing happens" /// *and* "I'm not finished with what I'm doing". For example: if we're in the middle of an upgrade, /// and we can't update a CockroachDB zone because the cluster is unhealthy, then we don't want /// to take any steps, but we _also_ want to communicate that this step (updating zones) isn't finished. fn report_blocked(&self, event: WaitingEvent) { ... } /// invoked by a step implementation to say "There's something preventing me from taking actions that /// I want to take, but I can do other stuff right now.". An example I'm thinking of is when, during SP /// update, we can't find a TUF artifact for a particular device. We may be able to update a different /// device, so we're not stuck (yet), but we still want to report the problem. fn report_nonblocking(&self, event: WaitingEvent) { ... } /// invoked by a step implementation to report that it did something: e.g., I configured an SP update, /// or I updated a zone, etc. These don't need to be super detailed (that would duplicate a lot of effort) /// but it would be useful to be able to say "I did queue an SP update" vs. "I did nothing". fn report_action(&self, event: ActionEvent) }

I've written this as something we'd pass to each function but it could as well be something that gets returned. If we gave this to each step (or got it back), then we could reliably distinguish these cases:

this step took no action and has nothing to do (for a bunch of upgrade steps, this is the condition that means: move on to the next step vs. not)

this step took some action (for a bunch of upgrade steps, this means don't move onto the next step. but for other steps, like "add" or "expunge", it might be fine to move on)

this step took no action and is blocked (again, don't move onto the next step)

We could choose to interpret this in do_plan() or we could pass something to subsequent steps like:

enum Step { MupdateResolution, AddNeeded, ExpungeNeeded, UpgradeSps, UpgradeZones, .. } struct PreviousStepResults { ... } impl PreviousStepResults { /// Returns true if a step is "satisfied", meaning that it took no action and is not blocked. /// UpgradeZones would check that UpgradeSps is satisfied before continuing. fn step_satisfied(&self, step: Step) -> bool { ... } /// Returns true if a step is blocked (allows steps to check if some previous step is waiting). /// In practice, I think `step_satisfied()` is better. fn step_blocked(&self, step: Step) -> bool { ... } /// Returns true if a step took action (allows steps to check if some previous step is done). /// In practice, I think `step_satisfied()` is better fn step_took_action(&self, step: Step) -> bool { ... } }

I'm not attached to any of this -- just kind of feeling out what people think of that sort of thing.

(Apologies if this feels backwards; I felt like the most sensible ordering of responses was starting from the end of your comment and working back to the beginning.)

Inside the hypothetical PreviousStepResults: I think I understand why you prefer step_satisfied, but "took no action and is not blocked" may be overly conservative in some cases, I think? For example, the happy path of upgrading an internal DNS zone is:

Planner expunges the zone because it's out of date; this becomes the target blueprint.

In subsequent planning attempts, do_plan_add() is blocked: it can't start a new internal DNS zone until inventory reflects that the expunged one is actually gone (so it can reuse the special DNS IP).

Inventory starts to report that the expunged zone is gone.

In the next planning run, do_plan_add() will take an action (adding a new internal DNS zone). That means do_plan_add() is not "satisfied" (because it took an action), but it would still be fine for the zone upgrade step to take its own action to upgrade the next zone in line.

Maybe "add a zone and also upgrade a zone in one planning iteration" is fine but unnecessarily optimized? If the zone upgrade step chose to do nothing in the iteration where a zone was added even when it could have, presumably the next planning iteration will see do_plan_add() is satisfied and therefore move on to the next upgrade. I'm not sure if there are other cases where "not blocked and did nothing" will be overly conservative in a way that we want to avoid.

On PlanningStepReporter: I like this a lot, at least as an output mechanism. A concern with trying to interpret its contents, if it's a common type shared between all steps, is that the WaitingEvent and ActionEvent would presumably have variants for the union of all possible waiting / action events across all steps. Matching on enums like that becomes pretty unwieldy, both because there end up being a lot of variants and because you end up with impossible cases (e.g., "I'm the zone update step, so I'll put unreachable!() for all the WaitingEvent::ZoneUpdate* variants if I'm asking what a previous step did").

Having separate WaitingEventAdd, WaitingEventUpgradeMgs, ..., enums has different tradeoffs. I'm not sure at this stage which might be better; just noting that we should make a conscious choice about whether to have enums that span all the steps.

But I still feel like the clearest option is to encode this decision in explicit control flow of the function that invokes all the steps. This way, I hope the steps themselves could be largely orthogonal, without thinking much about the other steps. The logic of when to bail out or proceed is all encoded in one place (do_plan(), basically).

I think I'm landing here too. In #8298 (comment) I'd sketched out a very wordy version of "pass previous results into subsequent steps for them to decide what to do". But if we had do_plan() just in-line decide which steps to call based on the results of all the previous steps, that seems like it sidesteps all the questions of "if I'm writing a step, what earlier or later steps do I need to look at to decide whether I or they can proceed" questions. It might me we need to break the existing steps up more finely? Or add more fine-grained steps in the future? But that seems fine and easy to do if we run into it.

plotnick · 2025-06-27T14:11:09Z

After discussions and some more thought, closing this in favor of other approaches with more explicit flow control and decoupled wait conditions.

plotnick added 2 commits June 25, 2025 15:42

Add planner wait conditions

3aa4aba

Report wait conditions in blueprint planner status

999169f

plotnick changed the base branch from main to sim-disks-from-blueprint June 25, 2025 22:14

smklein reviewed Jun 25, 2025

View reviewed changes

plotnick added 2 commits June 25, 2025 17:53

Enforce ordering of planner steps in one place

efd37e9

Add doc-comments to WaitCondition variants

386f3b5

plotnick force-pushed the planner-wait branch from b566bb5 to 386f3b5 Compare June 25, 2025 23:58

Better unreachable error

87f50ca

Co-authored-by: Sean Klein <[email protected]>

jgallagher reviewed Jun 26, 2025

View reviewed changes

plotnick closed this Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Planner wait conditions #8453

Planner wait conditions #8453

Uh oh!

plotnick commented Jun 25, 2025

Uh oh!

smklein Jun 25, 2025

Uh oh!

plotnick Jun 25, 2025

Uh oh!

smklein Jun 25, 2025

Uh oh!

plotnick Jun 25, 2025

Uh oh!

Uh oh!

smklein Jun 25, 2025

Uh oh!

plotnick Jun 26, 2025

Uh oh!

jgallagher Jun 26, 2025

Uh oh!

plotnick Jun 26, 2025

Uh oh!

davepacheco Jun 26, 2025

Uh oh!

jgallagher Jun 26, 2025

Uh oh!

plotnick commented Jun 27, 2025

Uh oh!

Uh oh!

Planner wait conditions #8453

Planner wait conditions #8453

Uh oh!

Conversation

plotnick commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plotnick commented Jun 27, 2025

Uh oh!

Uh oh!