Fix race condition in MLflow integration causing red team runs to stuck in starting state #44376

Copilot · 2025-12-10T20:56:50Z

Description

Red team runs were getting stuck in "starting" state due to a race condition in the MLflow integration. The update_red_team_run call would fail when executed immediately after create_evaluation_result, as the evaluation result hadn't fully propagated in the backend yet.

Changes

Added retry logic with exponential backoff to both API calls in _mlflow_integration.py:

Initialized RetryManager in MLflowIntegration.__init__() to handle transient failures
Wrapped create_evaluation_result call with retry decorator (5 attempts, 2-30s exponential backoff)
Wrapped update_red_team_run call with retry decorator to handle race condition
Improved error logging from warning to error severity with clear message about stuck state

The exponential backoff naturally resolves the race condition by providing progressive delays (2s, 3s, 4.5s, 6.75s, 10s) for backend propagation. Uses existing RetryManager infrastructure already employed throughout the red teaming codebase.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

pypi.org
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python3 python3 -m pip install black --quiet (dns block)
scanning-api.github.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

We have a recurring issue in the mlflow integration in red teaming in azure-ai-evaluation package. The update red team run call fails and then the run gets stuck in the starting state. Find an elegant way to avoid this issue

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: slister1001 <[email protected]>

…run to handle race conditions Co-authored-by: slister1001 <[email protected]>

Initial plan

69388c3

Copilot AI assigned Copilot and slister1001 Dec 10, 2025

Copilot started work on behalf of slister1001 December 10, 2025 20:57 View session

Copilot AI and others added 2 commits December 10, 2025 21:02

Add retry logic to update_red_team_run call to prevent stuck runs

9d01ad2

Co-authored-by: slister1001 <[email protected]>

Add retry logic to both create_evaluation_result and update_red_team_…

131e178

…run to handle race conditions Co-authored-by: slister1001 <[email protected]>

Copilot AI changed the title ~~[WIP] Fix mlflow integration issue in red team run~~ Fix race condition in MLflow integration causing red team runs to stuck in starting state Dec 10, 2025

Copilot AI requested a review from slister1001 December 10, 2025 21:11

Copilot finished work on behalf of slister1001 December 10, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition in MLflow integration causing red team runs to stuck in starting state #44376

Fix race condition in MLflow integration causing red team runs to stuck in starting state #44376

Copilot AI commented Dec 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix race condition in MLflow integration causing red team runs to stuck in starting state #44376

Are you sure you want to change the base?

Fix race condition in MLflow integration causing red team runs to stuck in starting state #44376

Conversation

Copilot AI commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 10, 2025 •

edited

Loading