Skip to content

Conversation

@mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Oct 8, 2025

This is staged on top of the merge of #1783 and #1777, so it's hard to review right now; once those PRs are merged, I'll rebase this one.

Fixes #1556 by adding a new transition from "three downstairs in LiveRepairReady" to reconciliation.

I tested this using the same sequence as #1783 to get all three downstairs faulted, then bringing them back online. Watching upstairs_info.d shows that we successfully go through reconciliation (REC) then back to active:

   PID  SESSION DS0 DS1 DS2   UPW   DSW      JOBID   WRITE_BO    IP0   IP1   IP2     D0    D1    D2     S0    S1    S2
 60907 ae1ef96e FLT FLT FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR FLT FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR FLT FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR FLT FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e LRR LRR FLT     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e REC REC REC     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e REC REC REC     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e REC REC REC     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e REC REC REC     0     0       8817          0      0     0     0      0     0     0      0     0     0
 60907 ae1ef96e ACT ACT ACT     1   221       9081          0      1     1     1    220   220   220      0     0     0
 60907 ae1ef96e ACT ACT ACT     1   529       9807          0      1     1     1    528   528   528      0     0     0

@mkeeter mkeeter requested review from jmpesp and leftwo October 8, 2025 18:40
@mkeeter mkeeter force-pushed the mkeeter/live-repair-reconcile branch from 7390c5e to ffd59f0 Compare October 9, 2025 15:59
@mkeeter mkeeter marked this pull request as ready for review October 9, 2025 15:59
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought on function naming.

I'm excited to see this back, as it unlocks a test I wrote years (and years) ago that I've been waiting to run.

/// If any of the downstairs is not in `LiveRepairReady`
#[must_use]
pub(crate) fn reconcile_from_live_repair_ready(&mut self) -> bool {
let mut max_flush = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to want a stat for DTrace when we do this. Just to see how often it happens. I'm happy to add that myself in a later PR (as I'll be updating the DTrace scripts to print it)

///
/// If all Downstairs are in `LiveRepairReady`, we instead begin
/// reconciliation.
pub(crate) fn check_live_repair_start(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not reall like the name of this function before, and more so now that will also do reconciliation from inside here if we deem its necessary.

We are:
Checking to see if we need to do live repair, and possibly checking to see if we need to do reconciliation.

Maybe call this verify_downstairs_consistency()?
I'm open to other suggestions too.
The comment where we call this should be updated to indicate that we are cheking for LR or Reconciliation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it also starts live-repair, so just verify isn't great (neither is check for that matter).

What about ensure_downstairs_consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesssssssssssss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed, and comments are now updated (please let me know if I missed any!)

@mkeeter mkeeter force-pushed the mkeeter/live-repair-reconcile branch 2 times, most recently from b3ae34e to 5094764 Compare October 13, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If three downstairs have all faulted, the upstairs can't self recover

3 participants