-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run live migration directly on the state driver task #720
Run live migration directly on the state driver task #720
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is just expansive enough that it's hard to tell what code moved and what code is new, so here are a few comments pointing out what's what (and one noting a typo that I need to go fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VmEnsure
types in this file are net new code. The "initialize VM components" free function in here moved in from startup.rs
unchanged.
I arranged the initialization phases this way in part to leave room to allow future versions of the LM protocol to have more of a say in exactly what instance spec is used to initialize VM objects. Right now it's just the spec from the ensure request, but I have my eyes on using (at least part of) the instance spec received from the source in the sync phase instead.
/// Negotiates a live migration protocol version with a target who has connected | ||
/// over `conn`. If this is successful, returns a `SourceProtocol` | ||
/// implementation that can be used to run the requested migration. | ||
pub(crate) async fn initiate<T: MigrateConn>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the logic in this function was taken from the (removed) source_start
function in migrate/mod.rs
.
|
||
/// A helper type for abstracting away type-level differences between the | ||
/// different phases of instance ensure. | ||
enum EnsureState<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper enum and its usage below are the other major source of net-new code in this PR.
/// in `migrate_info`, then negotiates a protocol version with that source. | ||
/// Returns a [`DestinationProtocol`] implementation for the negotiated version | ||
/// that the caller can use to run the migration. | ||
pub(crate) async fn initiate<'e>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is more or less identical to the old dest_initiate
function in the root migration module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I like the changes to remove the EnsureState
enum! I left some more style suggestions --- let me know what you think!
match phase { | ||
MigratePhase::RamPushPrePause => { | ||
self.update_state(MigrationState::RamPush).await | ||
} | ||
MigratePhase::RamPushPostPause => { | ||
self.update_state(MigrationState::RamPushDirty).await | ||
} | ||
MigratePhase::RamPushPrePause => self.update_state( | ||
ensure_ctx.state_publisher(), | ||
MigrationState::RamPush, | ||
), | ||
MigratePhase::RamPushPostPause => self.update_state( | ||
ensure_ctx.state_publisher(), | ||
MigrationState::RamPushDirty, | ||
), | ||
_ => unreachable!("should only push RAM in a RAM push phase"), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicky, considered unimportant: wouldn't it be a bit less repetitive to write:
let state = match phase {
MigratePhase::RamPushPrePause => MigrationState::RamPush,
MigratePhase::RamPushPostPause => MigrationState::RamPushPostPause,
_ => unreachable!("should only push RAM in a RAM push phase"),
};
self.update_state(ensure_ctx.state_publisher(), state);
Looks great to me! |
Run live migration protocols on the state driver's tokio task without using `block_on`: - Define `SourceProtocol` and `DestinationProtocol` traits that describe the routines the state driver uses to run a generic migration irrespective of its protocol version. (This will be useful for protocol versioning later.) - Move the `migrate::source_start` and `migrate::dest_initiate` routines into factory functions that connect to the peer Propolis, negotiate protocol versions, and return an appropriate protocol impl. - Use the protocol impls to run migration on the state driver task. Remove all the types and constructs used to pass messages between it and migration tasks. Also, improve the interface between the `vm` and `migrate` modules for inbound migrations by defining some objects that migrations can use either to fully initialize a VM or to unwind correctly if migration fails. This allows migration to take control of when precisely a VM's components get created (and from what spec) without exposing to the migration task all the complexity of unwinding from a failed attempt to create a VM. Tested via full PHD run with a Debian 11 guest.
Stacked on #715.
#715 changes propolis-server's VM state driver from an OS thread to a tokio task. This allows live migration protocols to run directly on the state driver without the use of
block_on
, so do just that. Specifically:SourceProtocol
andDestinationProtocol
traits that describe the routines the state driver uses to run a generic migration irrespective of its protocol version.migrate::source_start
andmigrate::dest_initiate
routines into factory functions that connect to the peer Propolis, negotiate protocol versions, and return an appropriate protocol impl.Also, try to improve the interface between the
vm
module and themigrate
module for inbound migrations. Inbound live migration is tricky to manage because (with #715) the various steps needed to create Propolis VM objects are interleaved with the steps in the live migration protocol; if migration fails, the precise set of steps needed to drive the VM state machine to its final state depends on exactly how many VM initialization steps the migration protocol has taken so far. To try to hide this complexity, define someVmEnsure
objects that migrations can use either to fully initialize a VM or to unwind correctly if a failure occurs.Tested with a full PHD run using a Debian 11 guest.