Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 10, 2025

Description

Red team runs were getting stuck in "starting" state due to a race condition in the MLflow integration. The update_red_team_run call would fail when executed immediately after create_evaluation_result, as the evaluation result hadn't fully propagated in the backend yet.

Changes

Added retry logic with exponential backoff to both API calls in _mlflow_integration.py:

  • Initialized RetryManager in MLflowIntegration.__init__() to handle transient failures
  • Wrapped create_evaluation_result call with retry decorator (5 attempts, 2-30s exponential backoff)
  • Wrapped update_red_team_run call with retry decorator to handle race condition
  • Improved error logging from warning to error severity with clear message about stuck state

The exponential backoff naturally resolves the race condition by providing progressive delays (2s, 3s, 4.5s, 6.75s, 10s) for backend propagation. Uses existing RetryManager infrastructure already employed throughout the red teaming codebase.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python3 python3 -m pip install black --quiet (dns block)
  • scanning-api.github.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

We have a recurring issue in the mlflow integration in red teaming in azure-ai-evaluation package. The update red team run call fails and then the run gets stuck in the starting state. Find an elegant way to avoid this issue


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Fix mlflow integration issue in red team run Fix race condition in MLflow integration causing red team runs to stuck in starting state Dec 10, 2025
Copilot AI requested a review from slister1001 December 10, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants