Skip to content

Implement LLM Resilience (Circuit Breaker & Retries) and migrate to a…#3548

Closed
addz9015 wants to merge 1 commit intoadenhq:mainfrom
addz9015:feature/llm-resilience
Closed

Implement LLM Resilience (Circuit Breaker & Retries) and migrate to a…#3548
addz9015 wants to merge 1 commit intoadenhq:mainfrom
addz9015:feature/llm-resilience

Conversation

@addz9015
Copy link

@addz9015 addz9015 commented Feb 4, 2026

Description

This PR introduces a robust resilience layer for LLM interactions within the Hive framework. It implements the Circuit Breaker pattern and enhanced exponential backoff retries to protect the system from LLM provider outages and transient failures. To support these features and ensure non-blocking execution, the entire LLM provider interface and graph execution components have been migrated to be fully asynchronous.

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected - LLM calls are now async)
  • Refactoring (no functional changes

Related Issues

Fixes #731

Changes Made

  • New Resilience Module: Created core/framework/llm/resilience.py containing CircuitBreaker, RetryHandler, and ResilienceConfig.
  • Async LLM Migration: Converted LLMProvider and all its implementations (LiteLLMProvider, AnthropicProvider, MockLLMProvider) to use async and await.
  • Integrated Resilience: Added a unified _execute_with_resilience wrapper in the base LLMProvider that automatically applies circuit breaking and retries to all completion calls.
  • Test Suite Modernization: Updated existing tests to use pytest-asyncio and AsyncMock, and added comprehensive unit tests for the resilience logic.

Testing

I have performed the following verification steps to ensure robustness:

Unit Testing - Resilience Layer (core/framework/llm/test_resilience.py)

  • Retry Logic: Verified success on second attempt, exponential backoff timing, and error raising after exhaustion.
  • Circuit Breaker: Confirmed state transitions from CLOSED to OPEN after failure threshold, and recovery via HALF-OPEN after timeout.
  • Immediate Rejection: Verified that calls fail immediately when the circuit is OPEN, saving resources and latency.

Integration Testing - Graph & Providers (core/tests/)

  • Provider Consistency: Ran test_litellm_provider.py (all 16 tests) to verify that async migration didn't break basic completion, system prompts, or tool calling.
  • Executor Stability: Ran test_graph_executor.py and test_executor_max_retries.py to ensure the graph engine correctly orchestrates async nodes and edges.
  • Complex Flows: Verified test_pydantic_validation.py and test_fanout.py to confirm that parallel branches and structured data extraction work perfectly with the new async architecture.

Checklist

  • State Management: Circuit Breaker correctly tracks failures and recovers over time.
  • Async Consistency: All LLM call sites in the core framework have been updated with await.
  • Backward Compatibility: Graph definitions remain unchanged; only the underlying execution engine is updated.
  • Mock Support: MockLLMProvider is updated to support async testing without real API calls.
  • Import Integrity: Fixed all missing ResilienceConfig and os imports discovered during testing.
  • Super Initialization: Ensured all providers call super().__init__ to instantiate resilience components.
  • Test Coverage: Added 6 new unit tests specifically for the resilience module.

Screenshots (if applicable)

Screenshot 2026-02-05 010800 image

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

PR Closed - Requirements Not Met

This PR has been automatically closed because it doesn't meet the requirements.

Missing: No linked issue found.

To fix:

  1. Create or find an existing issue for this work
  2. Assign yourself to the issue
  3. Re-open this PR and add Fixes #123 in the description

Exception: To bypass this requirement, you can:

  • Add the micro-fix label or include micro-fix in your PR title for trivial fixes
  • Add the documentation label or include doc/docs in your PR title for documentation changes

Micro-fix requirements (must meet ALL):

Qualifies Disqualifies
< 20 lines changed Any functional bug fix
Typos & Documentation & Linting Refactoring for "clean code"
No logic/API/DB changes New features (even tiny ones)

Why is this required? See #472 for details.

@github-actions github-actions bot closed this Feb 4, 2026
@addz9015 addz9015 deleted the feature/llm-resilience branch February 4, 2026 20:10
@addz9015 addz9015 restored the feature/llm-resilience branch February 4, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: No Retry Logic or Circuit Breakers for LLM Calls

1 participant