Skip to content

Conversation

@nikhilsuri-db
Copy link
Contributor

@nikhilsuri-db nikhilsuri-db commented Nov 4, 2025

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

This PR introduces a circuit breaker pattern to the telemetry system to prevent cascading failures and improve system resilience. The implementation uses the pybreaker library to monitor telemetry request failures and automatically open the circuit when failure rates exceed configurable thresholds, blocking further requests to protect downstream services. When the circuit is open, telemetry requests are temporarily blocked until the system recovers, at which point the circuit transitions to half-open for testing and eventually closes when normal operation resumes. The circuit breaker configuration is centralized with immutable settings, includes comprehensive logging for monitoring, and maintains full backward compatibility with existing telemetry functionality.

How is this tested?

  • Unit tests
  • E2E Tests
  • Manually
  • N/A

📦 Package Versions Verified
databricks-sql-connector: 4.2.1 (your local version)
pybreaker: 1.4.1 ✨ (successfully installed)
dbt-databricks: 1.11.0 ✅
databricks-sqlalchemy: 2.0.8 ✅
databricks-sdk: 0.67.0 ✅
sqlalchemy: 2.0.44 ✅
LLM GENERATED

Related Tickets & Documents

https://docs.google.com/document/d/1ftRvby9bwDZzE3s1tOb4hJ4Pd9USiXskb9cDw-uQNPM/edit?usp=sharing

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Signed-off-by: Nikhil Suri <[email protected]>
Copy link
Contributor

@samikshya-db samikshya-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. I have some concerns with the unit/e2e tests added : added comments around it. Please push the PR once all changes are made. Thanks.



@pytest.fixture(autouse=True)
def aggressive_circuit_breaker_config():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spent some time checking the unit tests and this e2e test. This is not adding something that we are not already testing tbh - or am I missing something? In which case we can remove this altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, this E2E test verifies the full integration path from SQL query execution through telemetry generation to circuit breaker enforcement with real Databricks connections. It ensures that when the circuit breaker opens after rate-limit errors, subsequent telemetry requests are actually blocked from hitting Databricks, and that circuit recovery works correctly with real timing.

Can you elaborate why you think the tests are not adding anything to the existing test coverage?

Copy link
Contributor

@samikshya-db samikshya-db Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only telemetry client is being mocked :

[c1] we're intercepting calls before they ever reach the circuit breaker layer. This means none of the responses are occurring naturally. This means we are testing mock behavior.
Circuit breaker transitions are already tested in test_circuit_breaker_manager.py, rate limit and error propogation are also tested in unit tests. I do not see a single use-case not being already tested via unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the unit tests - they test circuit breaker logic in isolation with mock functions (breaker.call(failing_func)), and telemetry client with direct API calls (_send_with_unified_client()). None of them execute real SQL queries (cursor.execute("SELECT 1")) or verify that telemetry is actually generated during query execution. The E2E test is the only place we validate the full journey: Real Query → Telemetry Generation → Circuit Breaker Enforcement. It also does not test that Circuit breaker state persists across multiple real SQL queries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we already have an e2e for telemetry with real queries test_concurrent_telemetry.py
  2. What state transitions are you talking about other than ones in test_circuit_breaker_manager.py?

I mean, i am ok with adding the e2e - I just don't want any redundant code in our repo - as long that it is not the case - I am good tbh. I have already approved the PR anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could generate a summary of use cases that will not be tested if we remove these test cases

# SCENARIO: You remove test_circuit_breaker.py

# What could break without detection:

1. Bug: Circuit breaker not invoked during SQL query execution
   - test_concurrent_telemetry.py: Won't catch (no circuit breaker testing)
   - test_circuit_breaker_manager.py: Won't catch (mocks, no real queries)
   - Result: Circuit breaker silently broken in production ❌

2. Bug: Circuit breaker opens but telemetry still sent
   - test_concurrent_telemetry.py: Won't catch (no error scenarios)
   - test_circuit_breaker_manager.py: Won't catch (no telemetry integration)
   - Result: Telemetry spam during outages ❌

3. Bug: 500 errors trigger circuit breaker (should only be 429/503)
   - test_concurrent_telemetry.py: Won't catch (no error scenarios)
   - test_circuit_breaker_manager.py: Won't catch (generic error testing)
   - Result: Circuit breaker too aggressive ❌

4. Bug: Circuit breaker config not passed through connection → telemetry
   - test_concurrent_telemetry.py: Won't catch (circuit breaker not tested)
   - test_circuit_breaker_manager.py: Won't catch (no connection flow)
   - Result: Circuit breaker never enabled ❌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 and 4 is good. Although (mocks, no real queries) and no telemetry integration is ambiguous as the current e2e has mocks too. But 2, 3 still covered in tests (test_request_circuit_breaker_open , test_telemetry_request_error_handling)

@nikhilsuri-db nikhilsuri-db merged commit 73580fe into main Nov 26, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants