Skip to content

Implement LLM Resilience Layer and Async LLM Migration#3549

Closed
addz9015 wants to merge 1 commit intoadenhq:mainfrom
addz9015:feature/llm-resilience
Closed

Implement LLM Resilience Layer and Async LLM Migration#3549
addz9015 wants to merge 1 commit intoadenhq:mainfrom
addz9015:feature/llm-resilience

Conversation

@addz9015
Copy link

@addz9015 addz9015 commented Feb 4, 2026

Description

This PR introduces a robust resilience layer for LLM interactions within the Hive framework. It implements the Circuit Breaker pattern and enhanced exponential backoff retries to protect the system from LLM provider outages and transient failures. To support these features and ensure non-blocking execution, the entire LLM provider interface and graph execution components have been migrated to be fully asynchronous.

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected - LLM calls are now async)
  • Refactoring (no functional changes

Related Issues

Fixes #731

Changes Made

  • New Resilience Module: Created core/framework/llm/resilience.py containing CircuitBreaker, RetryHandler, and ResilienceConfig.
  • Async LLM Migration: Converted LLMProvider and all its implementations (LiteLLMProvider, AnthropicProvider, MockLLMProvider) to use async and await.
  • Integrated Resilience: Added a unified _execute_with_resilience wrapper in the base LLMProvider that automatically applies circuit breaking and retries to all completion calls.
  • Test Suite Modernization: Updated existing tests to use pytest-asyncio and AsyncMock, and added comprehensive unit tests for the resilience logic.

Testing

I have performed the following verification steps to ensure robustness:

Unit Testing - Resilience Layer (core/framework/llm/test_resilience.py)

  • Retry Logic: Verified success on second attempt, exponential backoff timing, and error raising after exhaustion.
  • Circuit Breaker: Confirmed state transitions from CLOSED to OPEN after failure threshold, and recovery via HALF-OPEN after timeout.
  • Immediate Rejection: Verified that calls fail immediately when the circuit is OPEN, saving resources and latency.

Integration Testing - Graph & Providers (core/tests/)

  • Provider Consistency: Ran test_litellm_provider.py (all 16 tests) to verify that async migration didn't break basic completion, system prompts, or tool calling.
  • Executor Stability: Ran test_graph_executor.py and test_executor_max_retries.py to ensure the graph engine correctly orchestrates async nodes and edges.
  • Complex Flows: Verified test_pydantic_validation.py and test_fanout.py to confirm that parallel branches and structured data extraction work perfectly with the new async architecture.

Checklist

  • State Management: Circuit Breaker correctly tracks failures and recovers over time.
  • Async Consistency: All LLM call sites in the core framework have been updated with await.
  • Backward Compatibility: Graph definitions remain unchanged; only the underlying execution engine is updated.
  • Mock Support: MockLLMProvider is updated to support async testing without real API calls.
  • Import Integrity: Fixed all missing ResilienceConfig and os imports discovered during testing.
  • Super Initialization: Ensured all providers call super().__init__ to instantiate resilience components.
  • Test Coverage: Added 6 new unit tests specifically for the resilience module.

Screenshots (if applicable)

Screenshot 2026-02-05 010800 image

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

PR Closed - Requirements Not Met

This PR has been automatically closed because it doesn't meet the requirements.

PR Author: @addz9015
Found issues: #731 (assignees: none)
Problem: The PR author must be assigned to the linked issue.

To fix:

  1. Assign yourself (@addz9015) to one of the linked issues
  2. Re-open this PR

Exception: To bypass this requirement, you can:

  • Add the micro-fix label or include micro-fix in your PR title for trivial fixes
  • Add the documentation label or include doc/docs in your PR title for documentation changes

Micro-fix requirements (must meet ALL):

Qualifies Disqualifies
< 20 lines changed Any functional bug fix
Typos & Documentation & Linting Refactoring for "clean code"
No logic/API/DB changes New features (even tiny ones)

Why is this required? See #472 for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: No Retry Logic or Circuit Breakers for LLM Calls

1 participant