Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run live migration directly on the state driver task #720

Merged

Conversation

gjcolombo
Copy link
Contributor

Stacked on #715.

#715 changes propolis-server's VM state driver from an OS thread to a tokio task. This allows live migration protocols to run directly on the state driver without the use of block_on, so do just that. Specifically:

  • Define SourceProtocol and DestinationProtocol traits that describe the routines the state driver uses to run a generic migration irrespective of its protocol version.
  • Move the migrate::source_start and migrate::dest_initiate routines into factory functions that connect to the peer Propolis, negotiate protocol versions, and return an appropriate protocol impl.
  • Use the protocol impls to run migration on the state driver task. Remove all the types and constructs used to pass messages between this task and migration tasks.

Also, try to improve the interface between the vm module and the migrate module for inbound migrations. Inbound live migration is tricky to manage because (with #715) the various steps needed to create Propolis VM objects are interleaved with the steps in the live migration protocol; if migration fails, the precise set of steps needed to drive the VM state machine to its final state depends on exactly how many VM initialization steps the migration protocol has taken so far. To try to hide this complexity, define some VmEnsure objects that migrations can use either to fully initialize a VM or to unwind correctly if a failure occurs.

Tested with a full PHD run using a Debian 11 guest.

Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is just expansive enough that it's hard to tell what code moved and what code is new, so here are a few comments pointing out what's what (and one noting a typo that I need to go fix).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VmEnsure types in this file are net new code. The "initialize VM components" free function in here moved in from startup.rs unchanged.

I arranged the initialization phases this way in part to leave room to allow future versions of the LM protocol to have more of a say in exactly what instance spec is used to initialize VM objects. Right now it's just the spec from the ensure request, but I have my eyes on using (at least part of) the instance spec received from the source in the sync phase instead.

/// Negotiates a live migration protocol version with a target who has connected
/// over `conn`. If this is successful, returns a `SourceProtocol`
/// implementation that can be used to run the requested migration.
pub(crate) async fn initiate<T: MigrateConn>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic in this function was taken from the (removed) source_start function in migrate/mod.rs.

phd-tests/tests/src/migrate.rs Outdated Show resolved Hide resolved

/// A helper type for abstracting away type-level differences between the
/// different phases of instance ensure.
enum EnsureState<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper enum and its usage below are the other major source of net-new code in this PR.

/// in `migrate_info`, then negotiates a protocol version with that source.
/// Returns a [`DestinationProtocol`] implementation for the negotiated version
/// that the caller can use to run the migration.
pub(crate) async fn initiate<'e>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is more or less identical to the old dest_initiate function in the root migration module.

@gjcolombo gjcolombo marked this pull request as ready for review July 3, 2024 23:51
@gjcolombo gjcolombo requested a review from hawkw July 3, 2024 23:51
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
@gjcolombo gjcolombo requested a review from hawkw July 10, 2024 02:32
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I like the changes to remove the EnsureState enum! I left some more style suggestions --- let me know what you think!

Comment on lines 354 to 364
match phase {
MigratePhase::RamPushPrePause => {
self.update_state(MigrationState::RamPush).await
}
MigratePhase::RamPushPostPause => {
self.update_state(MigrationState::RamPushDirty).await
}
MigratePhase::RamPushPrePause => self.update_state(
ensure_ctx.state_publisher(),
MigrationState::RamPush,
),
MigratePhase::RamPushPostPause => self.update_state(
ensure_ctx.state_publisher(),
MigrationState::RamPushDirty,
),
_ => unreachable!("should only push RAM in a RAM push phase"),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, considered unimportant: wouldn't it be a bit less repetitive to write:

        let state = match phase {
            MigratePhase::RamPushPrePause => MigrationState::RamPush,
            MigratePhase::RamPushPostPause => MigrationState::RamPushPostPause,
            _ => unreachable!("should only push RAM in a RAM push phase"),
        };
        self.update_state(ensure_ctx.state_publisher(), state);

bin/propolis-server/src/lib/vm/ensure.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/ensure.rs Show resolved Hide resolved
bin/propolis-server/src/lib/vm/ensure.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/state_driver.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/state_driver.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/source.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/source.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
@gjcolombo
Copy link
Contributor Author

Thanks as always, @hawkw! Except where I commented otherwise the style feedback is incorporated into 6feb3c8.

@hawkw
Copy link
Member

hawkw commented Jul 11, 2024

Looks great to me!

@gjcolombo gjcolombo merged commit 994cdf3 into gjcolombo/server-refactor-2024 Jul 12, 2024
5 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/migrate-on-state-driver branch July 12, 2024 00:51
gjcolombo added a commit that referenced this pull request Jul 16, 2024
Run live migration protocols on the state driver's tokio task without using
`block_on`:

- Define `SourceProtocol` and `DestinationProtocol` traits that describe
  the routines the state driver uses to run a generic migration
  irrespective of its protocol version. (This will be useful for protocol
  versioning later.)
- Move the `migrate::source_start` and `migrate::dest_initiate` routines
  into factory functions that connect to the peer Propolis, negotiate
  protocol versions, and return an appropriate protocol impl.
- Use the protocol impls to run migration on the state driver task. Remove
  all the types and constructs used to pass messages between it and
  migration tasks.

Also, improve the interface between the `vm` and `migrate` modules for
inbound migrations by defining some objects that migrations can use either
to fully initialize a VM or to unwind correctly if migration fails. This
allows migration to take control of when precisely a VM's components get
created (and from what spec) without exposing to the migration task all the
complexity of unwinding from a failed attempt to create a VM.

Tested via full PHD run with a Debian 11 guest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants