Step 1: Graceful Exception Handling and Recovery #18

ishant162 · 2025-05-08T12:56:03Z

Summary

Background: Enhance Aggregator and Collaborators to gracefully catch exceptions, allowing the respective Director and Envoys to return to the wait_for_experiment state.

Type of Change (Mandatory)

Specify the type of change being made.

Feature enhancement (Error resilience)

Description (Mandatory)

This PR introduces graceful exception handling and improved error reporting across the federation experiment workflow. Key changes include:

Refactoring aggregator, experiment and director components to improve flow state tracking and status updates.
Updating GetFlowState responses to include exceptions.
Propagating exception details back to the user.

Testing

Verified exception handling during the Aggregator step.
Verified exception handling during the Collaborator step.
Handled conflict error when both include and exclude are set in self.next.
Handled deserialization exceptions caused by unserializable attributes in the flow.
Handled failure occurring when the Director attempts to send experiment data to Envoys.

Additional Information

Files modified:

openfl/experimental/workflow/component/aggregator/aggregator.py
openfl/experimental/workflow/component/director/director.py
openfl/experimental/workflow/component/director/experiment.py
openfl/experimental/workflow/component/envoy/envoy.py
openfl/experimental/workflow/interface/fl_spec.py
openfl/experimental/workflow/protocols/director.proto
openfl/experimental/workflow/runtime/federated_runtime.py
openfl/experimental/workflow/transport/grpc/director_client.py
openfl/experimental/workflow/transport/grpc/director_server.py

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

scngupta-dsp

Let us discuss when you have some time

openfl/experimental/workflow/component/director/director.py

openfl/experimental/workflow/component/director/experiment.py

openfl/experimental/workflow/interface/fl_spec.py

openfl/experimental/workflow/protocols/director.proto

openfl/experimental/workflow/transport/grpc/director_client.py

openfl/experimental/workflow/transport/grpc/director_server.py

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

Copilot

Pull Request Overview

This PR improves error handling and status reporting throughout the workflow by introducing graceful exception handling, extended flow state information (including exceptions), and updated logging. Key changes include:

Adding try/except blocks in director_server to report errors gracefully.
Updating GetFlowState methods and related proto messages to include an exception field.
Refactoring Experiment and Aggregator components to utilize a new ExperimentStatus structure for detailed status and error trace propagation.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
openfl/experimental/workflow/transport/grpc/director_server.py	Adds exception handling in GetExperimentData using try/except and reports errors via context.abort.
openfl/experimental/workflow/transport/grpc/director_client.py	Modifies get_flow_state to include exception details in its return value.
openfl/experimental/workflow/runtime/federated_runtime.py	Updates get_flow_state to handle the new triplet (status, flow object, exception).
openfl/experimental/workflow/protocols/director.proto	Introduces a new exception field to GetFlowStateResponse.
openfl/experimental/workflow/interface/fl_spec.py	Enhances error messaging in _run_federated by building a contextual error message.
openfl/experimental/workflow/component/envoy/envoy.py	Improves logging and error handling during experiment data retrieval and collaborator execution.
openfl/experimental/workflow/component/director/experiment.py	Refactors experiment status handling using the new ExperimentStatus dataclass and returns a detailed status dict.
openfl/experimental/workflow/component/director/director.py	Adjusts experiment waiting and flow state retrieval to align with updated API contracts.
openfl/experimental/workflow/component/aggregator/aggregator.py	Refactors run_flow with new helper methods for better flow initialization and collaborator task management.

openfl/experimental/workflow/transport/grpc/director_server.py

openfl/experimental/workflow/interface/fl_spec.py

openfl/experimental/workflow/component/director/experiment.py

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

Copilot

Pull Request Overview

This PR introduces graceful exception handling and improved error reporting across the federation experiment workflow. Key changes include:

Wrapping experiment data streaming in director_server.py and updating GetFlowState responses to include exceptions.
Propagating exception details through FederatedRuntime, Experiment, and director client components.
Refactoring aggregator and director components to improve flow state tracking and status updates.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
openfl/experimental/workflow/transport/grpc/director_server.py	Adds try/except block for GetExperimentData and updates GetFlowState to include exception details.
openfl/experimental/workflow/transport/grpc/director_client.py	Updates get_flow_state return signature to include an exception value.
openfl/experimental/workflow/runtime/federated_runtime.py	Adjusts get_flow_state method to process three return values.
openfl/experimental/workflow/interface/fl_spec.py	Improves exception handling in the federated runtime flow setup.
openfl/experimental/workflow/component/envoy/envoy.py	Reorders data stream handling to better separate exception handling logic.
openfl/experimental/workflow/component/director/experiment.py	Refactors experiment status tracking and error reporting using a new ExperimentStatus dataclass.
openfl/experimental/workflow/component/director/director.py	Modifies get_flow_state and experiment waiting logic with clearer state checks.
openfl/experimental/workflow/component/aggregator/aggregator.py	Replaces the old run_flow implementation and introduces helper methods for state initialization and collaborator queue handling.

Comments suppressed due to low confidence (2)

openfl/experimental/workflow/component/director/director.py:226

Avoid using magic numbers (like '4') when checking the experiment status. Use the corresponding enum constant (e.g., Status.FAILED) for improved clarity and maintainability.

if experiment.experiment_status.status.value == 4:

openfl/experimental/workflow/component/director/experiment.py:204

[nitpick] Consider providing a more descriptive error message that includes contextual details about the failure point, as the current message might be ambiguous when exp_name is None.

raise Exception(f"{error_msg} due to error: {e}") from e

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

scngupta-dsp

Some minor comments. Overall looks good Ishant. Few suggestions:

Can we have a relook into the Director / Aggregator / Experiment code and Envoy code to ensure that our changes do not look like patchwork
We also need to run a mental simulation to ensure that Phase 2 can be introduced seamlessly
Let us discuss

openfl/experimental/workflow/component/aggregator/aggregator.py

scngupta-dsp · 2025-05-29T07:01:26Z

openfl/experimental/workflow/component/aggregator/aggregator.py

+    def _restore_instance_snapshot(self) -> None:
+        """Restore instance snapshot if it exists."""
+        if hasattr(self, "instance_snapshot"):
+            self.flow.restore_instance_snapshot(self.flow, list(self.instance_snapshot))


Pls check if we need the check. FLSpec.restore_instance_snapshot already checks whether there is a backup

Can your loop be optimized for readability and efficiency by avoiding repeated dictionary lookups? And instead of k and V can we give them more generic name for readability ?

You're right about renaming k and v for better readability — I’ll update them to more meaningful names like collaborator and task_queue.

Regarding the optimization: the loop is already efficient as it avoids unnecessary operations by checking membership in selected_collaborators before putting the task into the queue. The dictionary lookup (self.__collaborator_tasks_queue.items()) is performed only once at the beginning, and each key is accessed only once per iteration. So, performance-wise, the loop is already optimal.

openfl/experimental/workflow/component/director/experiment.py

scngupta-dsp · 2025-05-29T07:16:49Z

openfl/experimental/workflow/component/director/experiment.py

+        self.experiment_status.update_experiment_status(Status.IN_PROGRESS)
+        logger.info(f"New experiment {self.name} for collaborators {self.collaborators}")


Can we make ExperimentStatus a sub-class of Experiment or something more object oriented / intuitive ? Please evaluate

Will evaluate

openfl/experimental/workflow/component/director/experiment.py

openfl/experimental/workflow/interface/fl_spec.py

openfl/experimental/workflow/protocols/director.proto

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

nguptax1987 · 2025-06-17T11:48:20Z

openfl/experimental/workflow/component/aggregator/aggregator.py

            )

    def all_quit_jobs_sent(self) -> bool:
        """Assert all quit jobs are sent to collaborators."""


The current docstring is grammatically understandable but could be made clearer and more descriptive. Example """Check whether quit jobs have been sent to all authorized collaborators."""
Use "Check whether..." instead of "Assert..." to reflect that it's a boolean check, not an assert statement.
Clarifies that it's about authorized collaborators, which adds context.

Incorporated.

nguptax1987 · 2025-06-17T12:01:50Z

openfl/experimental/workflow/component/director/experiment.py

+        self._aggregator_grpc_server = None
        self.aggregator = None
        self.updated_flow = None
+        self.experiment_exception_trace = None


experiment_exception_trace can be removed if we are not using it anywhere

nguptax1987 · 2025-06-17T12:05:50Z

openfl/experimental/workflow/interface/fl_spec.py

+                f"\033[91m{exception}\033[0m",
+            )
+        return flspec_obj



Can we use some constants or something more readable for color coding

Yes, that makes sense. Similarly, since color constants are used throughout OpenFL, we can create a dedicated utility module for them. This would promote consistency and reusability across the codebase. I suggest we consider implementing this in a future iteration.

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

ishant162 added 2 commits May 8, 2025 18:21

director and envoy recovery v1

1b78264

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

Merge branch 'securefederatedai:develop' into director_envoy_recovery

11479f7

ishant162 closed this May 9, 2025

ishant162 added 2 commits May 9, 2025 08:22

Merge branch 'securefederatedai:develop' into director_envoy_recovery

6c1794d

optimized director recovery

1ba83ea

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

ishant162 reopened this May 9, 2025

scngupta-dsp requested changes May 9, 2025

View reviewed changes

incorporated internal review comments

ecec852

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

ishant162 requested a review from Copilot May 9, 2025 08:07

Copilot AI reviewed May 9, 2025

View reviewed changes

openfl/experimental/workflow/transport/grpc/director_server.py Show resolved Hide resolved

openfl/experimental/workflow/interface/fl_spec.py Outdated Show resolved Hide resolved

openfl/experimental/workflow/component/director/experiment.py Outdated Show resolved Hide resolved

ishant162 added 2 commits May 9, 2025 15:28

Incorporated review comments

140d172

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

optimize code and handle corner cases

88a0c3e

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

ishant162 closed this May 12, 2025

Merge branch 'securefederatedai:develop' into director_envoy_recovery

e43e52d

ishant162 reopened this May 12, 2025

ishant162 requested a review from Copilot May 12, 2025 11:00

Copilot AI reviewed May 12, 2025

View reviewed changes

optimized code

40e8f3a

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

scngupta-dsp reviewed May 29, 2025

View reviewed changes

ishant162 added 2 commits June 3, 2025 11:28

Merge branch 'develop' into director_envoy_recovery

3abb84d

Incorporated internal review comments

3e15375

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

nguptax1987 reviewed Jun 17, 2025

View reviewed changes

Incoporate internal review comments

46c65b1

Signed-off-by: Ishant Thakare <ishantrog752@gmail.com>

ishant162 closed this Sep 8, 2025

Merge branch 'securefederatedai:develop' into director_envoy_recovery

70e883a

ishant162 reopened this Sep 8, 2025

		self.experiment_status.update_experiment_status(Status.IN_PROGRESS)
		logger.info(f"New experiment {self.name} for collaborators {self.collaborators}")

Step 1: Graceful Exception Handling and Recovery #18

Are you sure you want to change the base?

Step 1: Graceful Exception Handling and Recovery #18

Uh oh!

Conversation

ishant162 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change (Mandatory)

Description (Mandatory)

Testing

Additional Information

Uh oh!

scngupta-dsp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

scngupta-dsp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ishant162 commented May 8, 2025 •

edited

Loading