-
Notifications
You must be signed in to change notification settings - Fork 45
Planner wait conditions #8453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planner wait conditions #8453
Conversation
zone_id: OmicronZoneUuid, | ||
reason: ZoneExpungeReason, | ||
}, | ||
ZoneUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be explicit about the timing semantics; this means "a zone with the updated source has appeared in inventory", correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. Clarified with doc-comments in 386f3b5, but maybe we need better variant names?
planner_step!(MgsUpdatesStep, UpdateZonesStep); | ||
planner_step!(UpdateZonesStep, CockroachDbSettingsStep); | ||
planner_step!(CockroachDbSettingsStep, TerminalStep); | ||
planner_step!(TerminalStep, Never); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to be generating an impl PlannerStep for TerminalStep
that returns a Never
struct, but is Never
actually defined anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that macro was really broken, thanks for asking! It turned out that second argument was improperly being used as a type parameter name, and so didn't matter at all; Never
was me mis-remembering the name of the (experimental) type called !
. Fixed to actually do what I intended along with (a simple version of) your "ready tokens" idea in efd37e9.
fn do_plan_decommission(&mut self) -> Result<(), Error> { | ||
fn do_plan_decommission( | ||
&mut self, | ||
prev: AddStep, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a readability / refactorability nitpick, but I think (hope?) it'll be helpful:
As you've currently structured these steps, you need to provide:
LastStep
as input, and the output isUpdateStepResult<CurrentStep>
If we want to re-order steps, or make some steps concurrent in the future, we'll need to restructure this.
What do you think about re-framing this as:
- Input: a token that acts as "ready for current step"
- Output: same as it currently is
So, for example, this could be:
fn do_plan_decommission(
&mut self,
step: DecommissionStepReady,
) -> Result<UpdateStepResult<DecommissionStep>, Error> { ... }
Then, the macros you've created above can create this DecommisionStepReady
when we've finished the AddStep
, and we could basically only need to define the order in one spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, implemented in efd37e9.
Co-authored-by: Sean Klein <[email protected]>
let ready = step!(self.do_plan_mgs_updates(ready.into_next())); | ||
let ready = step!(self.do_plan_zone_updates(ready.into_next())?); | ||
step!(self.do_plan_cockroachdb_settings(ready.into_next())?); | ||
unreachable!("should have reached a terminal state already!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the vague comments - complaints without suggestions for fixing them - but you asked for some high level feedback! There are a few things about this that give me pause:
- Could any step choose to return
PlanningComplete
? It looks like right now, the cockroach settings step implicitly knows that it's the last step; if we were to add another step after that, would we have to go back and change that step to returnContinueToNextStep
instead? - (Probably strongly related to the previous point) Do we need an explicit
TerminalStep
? Or, if we do, can we enforce that the terminal step is the only step that means "planning complete"? Ideally in a way that means we don't need thisunreachable!()
. - Is it sensible for a step to return
Waiting(vec![])
(i.e., waiting with an empty vec)? I've wished for a nonempty vec type before. There are a bunch of options here; three are: say "just don't do that", use or create some NonEmptyVec type, or make it something likeWaiting(WaitCondition, Vec<WaitCondition>)
(so at least one condition is required, and you can optionally add others). - My biggest concern: this is adding a system where any planning step can force all subsequent steps to be skipped. I'm not sure the dependencies between steps are that straightforward. It's definitely an incorrect change given the way steps are currently ordered (e.g., just because we have MGS updates to do doesn't mean we shouldn't apply cockroach db settings; on main today, MGS updates existing only skip zone updates, not cockroach settings). Maybe we could address this by reordering the existing steps? But it seems weird that
do_plan_add()
could cause us to skipdo_plan_decommission()
; maybedo_plan_add()
only really wants to causedo_zone_updates()
to be skipped, but decommissioning or MGS updates would be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these concerns are addressable in the framework here except for your biggest one. If we want to encode control-flow information in a uniform return type, I don't see how we can prevent one step from skipping the rest. Sometimes that's what we'll want: in #8456, for instance, if do_plan_mupdate_override()
returns Waiting
, it skips all subsequent steps (except, as you noted, for cockroach settings, which I also guess could be moved earlier). That PR adds explicit control flow in do_plan
; this PR tries to go the other way and make the step functions control the flow and their invocation more uniform. But maybe we don't want that! I'm not married to the approach or the direction.
The code you sketched in this comment on #8298 shows another possible direction, where we accumulate information from previous steps and pass it all to subsequent ones to let them make local, explicit control-flow decisions. That seems fine, too, though I worry somewhat about scaling that approach as we add more steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets at an important question: who decides that because step N is waiting for something that some later step M must not run? Is it: step N, step M, or the code that calls both of them? Obviously it can be any of these, but some choices here seem better than others.
It seems fragile to say that step N gets to decide that subsequent steps cannot run because it involves a lot of non-local reasoning:
- When adding any step, you need to check the code of all previous steps to figure out if they might cause your step not to run, and if so, whether that's correct.
- When modifying any step to either start returning the "stop" value or stop returning it, you need to inspect the code of all subsequent steps to make sure that's still right.
In both cases, this gets worse as you add more steps. We might even run into situations where we want some subsequent steps to run but not others and it's not clear how to communicate that.
It seems simpler to say that step M gets to decide if it runs, given some information about what previous steps did. This way, the reasoning involving each step is in one place. That said, it still means that when working on step M, you need to think about the behavior of all the previous steps. But hopefully that's a little explicit in whatever information we provide about what they did?
But I still feel like the clearest option is to encode this decision in explicit control flow of the function that invokes all the steps. This way, I hope the steps themselves could be largely orthogonal, without thinking much about the other steps. The logic of when to bail out or proceed is all encoded in one place (do_plan()
, basically).
On the specific question of WaitConditions, I wonder about an interface like:
struct PlanningStepReporter { ... }
impl PlanningStepReporter {
/// invoked by a step implementation to say: "I cannot make forward progress until this thing happens"
/// *and* "I'm not finished with what I'm doing". For example: if we're in the middle of an upgrade,
/// and we can't update a CockroachDB zone because the cluster is unhealthy, then we don't want
/// to take any steps, but we _also_ want to communicate that this step (updating zones) isn't finished.
fn report_blocked(&self, event: WaitingEvent) { ... }
/// invoked by a step implementation to say "There's something preventing me from taking actions that
/// I want to take, but I can do other stuff right now.". An example I'm thinking of is when, during SP
/// update, we can't find a TUF artifact for a particular device. We may be able to update a different
/// device, so we're not stuck (yet), but we still want to report the problem.
fn report_nonblocking(&self, event: WaitingEvent) { ... }
/// invoked by a step implementation to report that it did something: e.g., I configured an SP update,
/// or I updated a zone, etc. These don't need to be super detailed (that would duplicate a lot of effort)
/// but it would be useful to be able to say "I did queue an SP update" vs. "I did nothing".
fn report_action(&self, event: ActionEvent)
}
I've written this as something we'd pass to each function but it could as well be something that gets returned. If we gave this to each step (or got it back), then we could reliably distinguish these cases:
- this step took no action and has nothing to do (for a bunch of upgrade steps, this is the condition that means: move on to the next step vs. not)
- this step took some action (for a bunch of upgrade steps, this means don't move onto the next step. but for other steps, like "add" or "expunge", it might be fine to move on)
- this step took no action and is blocked (again, don't move onto the next step)
We could choose to interpret this in do_plan()
or we could pass something to subsequent steps like:
enum Step { MupdateResolution, AddNeeded, ExpungeNeeded, UpgradeSps, UpgradeZones, .. }
struct PreviousStepResults { ... }
impl PreviousStepResults {
/// Returns true if a step is "satisfied", meaning that it took no action and is not blocked.
/// UpgradeZones would check that UpgradeSps is satisfied before continuing.
fn step_satisfied(&self, step: Step) -> bool { ... }
/// Returns true if a step is blocked (allows steps to check if some previous step is waiting).
/// In practice, I think `step_satisfied()` is better.
fn step_blocked(&self, step: Step) -> bool { ... }
/// Returns true if a step took action (allows steps to check if some previous step is done).
/// In practice, I think `step_satisfied()` is better
fn step_took_action(&self, step: Step) -> bool { ... }
}
I'm not attached to any of this -- just kind of feeling out what people think of that sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Apologies if this feels backwards; I felt like the most sensible ordering of responses was starting from the end of your comment and working back to the beginning.)
Inside the hypothetical PreviousStepResults
: I think I understand why you prefer step_satisfied
, but "took no action and is not blocked" may be overly conservative in some cases, I think? For example, the happy path of upgrading an internal DNS zone is:
- Planner expunges the zone because it's out of date; this becomes the target blueprint.
- In subsequent planning attempts,
do_plan_add()
is blocked: it can't start a new internal DNS zone until inventory reflects that the expunged one is actually gone (so it can reuse the special DNS IP). - Inventory starts to report that the expunged zone is gone.
- In the next planning run,
do_plan_add()
will take an action (adding a new internal DNS zone). That meansdo_plan_add()
is not "satisfied" (because it took an action), but it would still be fine for the zone upgrade step to take its own action to upgrade the next zone in line.
Maybe "add a zone and also upgrade a zone in one planning iteration" is fine but unnecessarily optimized? If the zone upgrade step chose to do nothing in the iteration where a zone was added even when it could have, presumably the next planning iteration will see do_plan_add()
is satisfied and therefore move on to the next upgrade. I'm not sure if there are other cases where "not blocked and did nothing" will be overly conservative in a way that we want to avoid.
On PlanningStepReporter
: I like this a lot, at least as an output mechanism. A concern with trying to interpret its contents, if it's a common type shared between all steps, is that the WaitingEvent
and ActionEvent
would presumably have variants for the union of all possible waiting / action events across all steps. Matching on enums like that becomes pretty unwieldy, both because there end up being a lot of variants and because you end up with impossible cases (e.g., "I'm the zone update step, so I'll put unreachable!()
for all the WaitingEvent::ZoneUpdate*
variants if I'm asking what a previous step did").
Having separate WaitingEventAdd
, WaitingEventUpgradeMgs
, ..., enums has different tradeoffs. I'm not sure at this stage which might be better; just noting that we should make a conscious choice about whether to have enums that span all the steps.
But I still feel like the clearest option is to encode this decision in explicit control flow of the function that invokes all the steps. This way, I hope the steps themselves could be largely orthogonal, without thinking much about the other steps. The logic of when to bail out or proceed is all encoded in one place (do_plan(), basically).
I think I'm landing here too. In #8298 (comment) I'd sketched out a very wordy version of "pass previous results into subsequent steps for them to decide what to do". But if we had do_plan()
just in-line decide which steps to call based on the results of all the previous steps, that seems like it sidesteps all the questions of "if I'm writing a step, what earlier or later steps do I need to look at to decide whether I or they can proceed" questions. It might me we need to break the existing steps up more finely? Or add more fine-grained steps in the future? But that seems fine and easy to do if we run into it.
After discussions and some more thought, closing this in favor of other approaches with more explicit flow control and decoupled wait conditions. |
Fixes #8284, hoping to also (eventually) fix #8265, #8298, and the bug surfaced by the test in #8441. Takes the approach mentioned in #8298 and enforces planning steps/phases at compile-time. Also exposes wait conditions via the
blueprint_planner
background task (not really tested or tuned).