Implement LLM Resilience (Circuit Breaker & Retries) and migrate to a…#3548
Closed
addz9015 wants to merge 1 commit intoadenhq:mainfrom
Closed
Implement LLM Resilience (Circuit Breaker & Retries) and migrate to a…#3548addz9015 wants to merge 1 commit intoadenhq:mainfrom
addz9015 wants to merge 1 commit intoadenhq:mainfrom
Conversation
PR Closed - Requirements Not MetThis PR has been automatically closed because it doesn't meet the requirements. Missing: No linked issue found. To fix:
Exception: To bypass this requirement, you can:
Micro-fix requirements (must meet ALL):
Why is this required? See #472 for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a robust resilience layer for LLM interactions within the Hive framework. It implements the Circuit Breaker pattern and enhanced exponential backoff retries to protect the system from LLM provider outages and transient failures. To support these features and ensure non-blocking execution, the entire LLM provider interface and graph execution components have been migrated to be fully asynchronous.
Type of Change
Related Issues
Fixes #731
Changes Made
core/framework/llm/resilience.pycontainingCircuitBreaker,RetryHandler, andResilienceConfig.LLMProviderand all its implementations (LiteLLMProvider,AnthropicProvider,MockLLMProvider) to use async andawait._execute_with_resiliencewrapper in the baseLLMProviderthat automatically applies circuit breaking and retries to all completion calls.pytest-asyncioandAsyncMock, and added comprehensive unit tests for the resilience logic.Testing
I have performed the following verification steps to ensure robustness:
Unit Testing - Resilience Layer (
core/framework/llm/test_resilience.py)CLOSEDtoOPENafter failure threshold, and recovery viaHALF-OPENafter timeout.OPEN, saving resources and latency.Integration Testing - Graph & Providers (
core/tests/)test_litellm_provider.py(all 16 tests) to verify that async migration didn't break basic completion, system prompts, or tool calling.test_graph_executor.pyandtest_executor_max_retries.pyto ensure the graph engine correctly orchestrates async nodes and edges.test_pydantic_validation.pyandtest_fanout.pyto confirm that parallel branches and structured data extraction work perfectly with the new async architecture.Checklist
await.MockLLMProvideris updated to support async testing without real API calls.ResilienceConfigandosimports discovered during testing.super().__init__to instantiate resilience components.Screenshots (if applicable)