Skip to content

Conversation

dulinriley
Copy link
Contributor

Summary:
Part of #1209

Supervision was not working in cases where there was no "await" of an endpoint.
This didn't provide error messages when an endpoint was broadcasted to without waiting
for a result. This can happen with "broadcast", but also with implicit messages like the __init__
constructor as well.

To fix this, move the creation of the state polling loop from the first call to "supervision_event" to closer
to construction time. It cannot be done in the constructor due to needing an "Instance" which is
not availble, particularly when unpickling an ActorMeshRef. Each "supervision_event" call clones
a tokio::watch::Receiver so it gets the same most recent event as every other endpoint.

This also allows a mesh to report an error to its owner much earlier if that mesh was idle.
Also improve the stringification of MeshFailure to not use the Debug format which is too verbose
for ActorSupervisionEvent.

Differential Revision: D85093139

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 21, 2025

@dulinriley has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85093139.

dulinriley added a commit to dulinriley/monarch that referenced this pull request Oct 21, 2025
…a-pytorch#1628)

Summary:

Part of meta-pytorch#1209

Supervision was not working in cases where there was no "await" of an endpoint.
This didn't provide error messages when an endpoint was broadcasted to without waiting
for a result. This can happen with "broadcast", but also with implicit messages like the `__init__`
constructor as well.

To fix this, move the creation of the state polling loop from the first call to "supervision_event" to closer
to construction time. It cannot be done in the constructor due to needing an "Instance" which is
not availble, particularly when unpickling an ActorMeshRef. Each "supervision_event" call clones
a tokio::watch::Receiver so it gets the same most recent event as every other endpoint.

This also allows a mesh to report an error to its owner much earlier if that mesh was idle.
Also improve the stringification of MeshFailure to not use the Debug format which is too verbose
for ActorSupervisionEvent.

Differential Revision: D85093139
…a-pytorch#1628)

Summary:

Part of meta-pytorch#1209

Supervision was not working in cases where there was no "await" of an endpoint.
This didn't provide error messages when an endpoint was broadcasted to without waiting
for a result. This can happen with "broadcast", but also with implicit messages like the `__init__`
constructor as well.

To fix this, move the creation of the state polling loop from the first call to "supervision_event" to closer
to construction time. It cannot be done in the constructor due to needing an "Instance" which is
not availble, particularly when unpickling an ActorMeshRef. Each "supervision_event" call clones
a tokio::watch::Receiver so it gets the same most recent event as every other endpoint.

This also allows a mesh to report an error to its owner much earlier if that mesh was idle.
Also improve the stringification of MeshFailure to not use the Debug format which is too verbose
for ActorSupervisionEvent.

Reviewed By: mariusae

Differential Revision: D85093139
@meta-codesync
Copy link

meta-codesync bot commented Oct 21, 2025

This pull request has been merged in 2c811fd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants