Skip to content

Conversation

@ishant162
Copy link
Owner

@ishant162 ishant162 commented May 8, 2025

Summary

Background: Enhance Aggregator and Collaborators to gracefully catch exceptions, allowing the respective Director and Envoys to return to the wait_for_experiment state.

Type of Change (Mandatory)

Specify the type of change being made.

  • Feature enhancement (Error resilience)

Description (Mandatory)

This PR introduces graceful exception handling and improved error reporting across the federation experiment workflow. Key changes include:

  • Refactoring aggregator, experiment and director components to improve flow state tracking and status updates.
  • Updating GetFlowState responses to include exceptions.
  • Propagating exception details back to the user.

Testing

  • Verified exception handling during the Aggregator step.
  • Verified exception handling during the Collaborator step.
  • Handled conflict error when both include and exclude are set in self.next.
  • Handled deserialization exceptions caused by unserializable attributes in the flow.
  • Handled failure occurring when the Director attempts to send experiment data to Envoys.

Additional Information

Files modified:

  • openfl/experimental/workflow/component/aggregator/aggregator.py
  • openfl/experimental/workflow/component/director/director.py
  • openfl/experimental/workflow/component/director/experiment.py
  • openfl/experimental/workflow/component/envoy/envoy.py
  • openfl/experimental/workflow/interface/fl_spec.py
  • openfl/experimental/workflow/protocols/director.proto
  • openfl/experimental/workflow/runtime/federated_runtime.py
  • openfl/experimental/workflow/transport/grpc/director_client.py
  • openfl/experimental/workflow/transport/grpc/director_server.py

@ishant162 ishant162 closed this May 9, 2025
@ishant162 ishant162 reopened this May 9, 2025
Copy link
Collaborator

@scngupta-dsp scngupta-dsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us discuss when you have some time

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>
@ishant162 ishant162 requested a review from Copilot May 9, 2025 08:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves error handling and status reporting throughout the workflow by introducing graceful exception handling, extended flow state information (including exceptions), and updated logging. Key changes include:

  • Adding try/except blocks in director_server to report errors gracefully.
  • Updating GetFlowState methods and related proto messages to include an exception field.
  • Refactoring Experiment and Aggregator components to utilize a new ExperimentStatus structure for detailed status and error trace propagation.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
openfl/experimental/workflow/transport/grpc/director_server.py Adds exception handling in GetExperimentData using try/except and reports errors via context.abort.
openfl/experimental/workflow/transport/grpc/director_client.py Modifies get_flow_state to include exception details in its return value.
openfl/experimental/workflow/runtime/federated_runtime.py Updates get_flow_state to handle the new triplet (status, flow object, exception).
openfl/experimental/workflow/protocols/director.proto Introduces a new exception field to GetFlowStateResponse.
openfl/experimental/workflow/interface/fl_spec.py Enhances error messaging in _run_federated by building a contextual error message.
openfl/experimental/workflow/component/envoy/envoy.py Improves logging and error handling during experiment data retrieval and collaborator execution.
openfl/experimental/workflow/component/director/experiment.py Refactors experiment status handling using the new ExperimentStatus dataclass and returns a detailed status dict.
openfl/experimental/workflow/component/director/director.py Adjusts experiment waiting and flow state retrieval to align with updated API contracts.
openfl/experimental/workflow/component/aggregator/aggregator.py Refactors run_flow with new helper methods for better flow initialization and collaborator task management.

ishant162 added 2 commits May 9, 2025 15:28
Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>
Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>
@ishant162 ishant162 closed this May 12, 2025
@ishant162 ishant162 reopened this May 12, 2025
@ishant162 ishant162 requested a review from Copilot May 12, 2025 11:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces graceful exception handling and improved error reporting across the federation experiment workflow. Key changes include:

  • Wrapping experiment data streaming in director_server.py and updating GetFlowState responses to include exceptions.
  • Propagating exception details through FederatedRuntime, Experiment, and director client components.
  • Refactoring aggregator and director components to improve flow state tracking and status updates.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
openfl/experimental/workflow/transport/grpc/director_server.py Adds try/except block for GetExperimentData and updates GetFlowState to include exception details.
openfl/experimental/workflow/transport/grpc/director_client.py Updates get_flow_state return signature to include an exception value.
openfl/experimental/workflow/runtime/federated_runtime.py Adjusts get_flow_state method to process three return values.
openfl/experimental/workflow/interface/fl_spec.py Improves exception handling in the federated runtime flow setup.
openfl/experimental/workflow/component/envoy/envoy.py Reorders data stream handling to better separate exception handling logic.
openfl/experimental/workflow/component/director/experiment.py Refactors experiment status tracking and error reporting using a new ExperimentStatus dataclass.
openfl/experimental/workflow/component/director/director.py Modifies get_flow_state and experiment waiting logic with clearer state checks.
openfl/experimental/workflow/component/aggregator/aggregator.py Replaces the old run_flow implementation and introduces helper methods for state initialization and collaborator queue handling.
Comments suppressed due to low confidence (2)

openfl/experimental/workflow/component/director/director.py:226

  • Avoid using magic numbers (like '4') when checking the experiment status. Use the corresponding enum constant (e.g., Status.FAILED) for improved clarity and maintainability.
if experiment.experiment_status.status.value == 4:

openfl/experimental/workflow/component/director/experiment.py:204

  • [nitpick] Consider providing a more descriptive error message that includes contextual details about the failure point, as the current message might be ambiguous when exp_name is None.
raise Exception(f"{error_msg} due to error: {e}") from e

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>
Copy link
Collaborator

@scngupta-dsp scngupta-dsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Overall looks good Ishant. Few suggestions:

  • Can we have a relook into the Director / Aggregator / Experiment code and Envoy code to ensure that our changes do not look like patchwork
  • We also need to run a mental simulation to ensure that Phase 2 can be introduced seamlessly
    Let us discuss

def _restore_instance_snapshot(self) -> None:
"""Restore instance snapshot if it exists."""
if hasattr(self, "instance_snapshot"):
self.flow.restore_instance_snapshot(self.flow, list(self.instance_snapshot))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls check if we need the check. FLSpec.restore_instance_snapshot already checks whether there is a backup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can your loop be optimized for readability and efficiency by avoiding repeated dictionary lookups? And instead of k and V can we give them more generic name for readability ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about renaming k and v for better readability — I’ll update them to more meaningful names like collaborator and task_queue.

Regarding the optimization: the loop is already efficient as it avoids unnecessary operations by checking membership in selected_collaborators before putting the task into the queue. The dictionary lookup (self.__collaborator_tasks_queue.items()) is performed only once at the beginning, and each key is accessed only once per iteration. So, performance-wise, the loop is already optimal.

Comment on lines +229 to +230
self.experiment_status.update_experiment_status(Status.IN_PROGRESS)
logger.info(f"New experiment {self.name} for collaborators {self.collaborators}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make ExperimentStatus a sub-class of Experiment or something more object oriented / intuitive ? Please evaluate

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will evaluate

)

def all_quit_jobs_sent(self) -> bool:
"""Assert all quit jobs are sent to collaborators."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current docstring is grammatically understandable but could be made clearer and more descriptive. Example """Check whether quit jobs have been sent to all authorized collaborators."""
Use "Check whether..." instead of "Assert..." to reflect that it's a boolean check, not an assert statement.
Clarifies that it's about authorized collaborators, which adds context.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated.

self._aggregator_grpc_server = None
self.aggregator = None
self.updated_flow = None
self.experiment_exception_trace = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experiment_exception_trace can be removed if we are not using it anywhere

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

f"\033[91m{exception}\033[0m",
)
return flspec_obj

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some constants or something more readable for color coding

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. Similarly, since color constants are used throughout OpenFL, we can create a dedicated utility module for them. This would promote consistency and reusability across the codebase. I suggest we consider implementing this in a future iteration.

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>
@ishant162 ishant162 closed this Sep 8, 2025
@ishant162 ishant162 reopened this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants