fix: provider_data_var context leak (backport #5227) by mergify[bot] · Pull Request #5247 · llamastack/llama-stack

mergify · 2026-03-23T13:34:12Z

What does this PR do?

Following PR description is generated using claude:

PR #5168 fixed OTel trace context leaking into background workers, but PROVIDER_DATA_VAR — the ContextVar that carries authenticated user identity — suffers from the same asyncio.create_task copy semantics. When a background worker is spawned, it permanently inherits the spawning request's PROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means:

Chat completions written through the InferenceStore write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via AuthorizedSqlStore.
Responses processed through the OpenAIResponsesImpl background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership.

This PR generalizes the OTel-only utilities from #5168 into a unified RequestContext that captures both the OTel trace context and PROVIDER_DATA_VAR together. The three helpers in core/task.py are replaced:

Before (#5168)	After (this PR)
`capture_otel_context()`	`capture_request_context()` — snapshots OTel context and provider data
`activate_otel_context(ctx)`	`activate_request_context(ctx)` — restores both per work-item
`create_task_with_detached_otel_context(coro)`	`create_detached_background_task(coro)` — clears both before task creation

Both InferenceStore and OpenAIResponsesImpl are updated to capture a RequestContext at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace.

Closes #5221

Test Plan

tests/unit/core/test_task.py (10 tests): Verifies RequestContext capture/activate semantics, detached task isolation for both OTel and PROVIDER_DATA_VAR, caller context restoration, queue-based propagation patterns, and cross-contamination prevention.
tests/unit/utils/inference/test_provider_data_leak.py (1 test): Reproduces the InferenceStore write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via AuthorizedSqlStore access policies. This test fails without the fix.
tests/unit/providers/agents/builtin/test_responses_background.py (6 new tests):
- TestResponsesOtelContextPropagation (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace.
- TestResponsesProviderDataPropagation (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user.
  This is an automatic backport of pull request fix: provider_data_var context leak #5227 done by Mergify.

# What does this PR do? Following PR description is generated using claude: PR #5168 fixed OTel trace context leaking into background workers, but `PROVIDER_DATA_VAR` — the `ContextVar` that carries authenticated user identity — suffers from the same `asyncio.create_task` copy semantics. When a background worker is spawned, it permanently inherits the spawning request's `PROVIDER_DATA_VAR`, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means: - Chat completions written through the `InferenceStore` write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via `AuthorizedSqlStore`. - Responses processed through the `OpenAIResponsesImpl` background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership. This PR generalizes the OTel-only utilities from #5168 into a unified `RequestContext` that captures **both** the OTel trace context and `PROVIDER_DATA_VAR` together. The three helpers in `core/task.py` are replaced: | Before (#5168) | After (this PR) | |---|---| | `capture_otel_context()` | `capture_request_context()` — snapshots OTel context **and** provider data | | `activate_otel_context(ctx)` | `activate_request_context(ctx)` — restores both per work-item | | `create_task_with_detached_otel_context(coro)` | `create_detached_background_task(coro)` — clears both before task creation | Both `InferenceStore` and `OpenAIResponsesImpl` are updated to capture a `RequestContext` at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace. Closes #5221 ## Test Plan - **`tests/unit/core/test_task.py`** (10 tests): Verifies `RequestContext` capture/activate semantics, detached task isolation for both OTel and `PROVIDER_DATA_VAR`, caller context restoration, queue-based propagation patterns, and cross-contamination prevention. - **`tests/unit/utils/inference/test_provider_data_leak.py`** (1 test): Reproduces the `InferenceStore` write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via `AuthorizedSqlStore` access policies. This test fails without the fix. - **`tests/unit/providers/agents/builtin/test_responses_background.py`** (6 new tests): - `TestResponsesOtelContextPropagation` (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace. - `TestResponsesProviderDataPropagation` (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user. --------- Signed-off-by: Jaideep Rao <jrao@redhat.com> (cherry picked from commit 9b86ce8) # Conflicts: # src/llama_stack/core/task.py # src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py # src/llama_stack/providers/utils/inference/inference_store.py # tests/unit/core/test_task.py # tests/unit/providers/agents/builtin/test_responses_background.py

mergify · 2026-03-23T13:34:13Z

Cherry-pick of 9b86ce8 has failed:

On branch mergify/bp/release-0.6.x/pr-5227
Your branch is up to date with 'origin/release-0.6.x'.

You are currently cherry-picking commit 9b86ce80.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   tests/unit/utils/inference/test_provider_data_leak.py

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   src/llama_stack/core/task.py
	both modified:   src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py
	both modified:   src/llama_stack/providers/utils/inference/inference_store.py
	deleted by us:   tests/unit/core/test_task.py
	deleted by us:   tests/unit/providers/agents/builtin/test_responses_background.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

leseb · 2026-03-23T13:36:13Z

@jaideepr97 please fix conflicts or open up a new PR against 0.6 if you don't have permsisions, thanks!

jaideepr97 · 2026-03-23T14:27:06Z

@leseb I don't have permissions to push to a mergify branch
raised a diff PR here: #5250
though this builds heavily on top of #5228 so we'll want to get that one in first and then I can rebase mine

jaideepr97 · 2026-03-23T14:52:24Z

@leseb
@iamemilio is ok with just merging that one since it also covers his backport
#5250 (comment)

bbrowning · 2026-04-01T16:42:09Z

Closing this one in favor of the reference #5250.

mergify bot added the conflicts label Mar 23, 2026

mergify bot requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 23, 2026 13:34

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026

mergify bot mentioned this pull request Mar 23, 2026

fix: provider_data_var context leak #5227

Merged

bbrowning closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: provider_data_var context leak (backport #5227)#5247

fix: provider_data_var context leak (backport #5227)#5247
mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
mergify/bp/release-0.6.x/pr-5227

mergify bot commented Mar 23, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

leseb commented Mar 23, 2026

Uh oh!

jaideepr97 commented Mar 23, 2026

Uh oh!

jaideepr97 commented Mar 23, 2026

Uh oh!

bbrowning commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergify bot commented Mar 23, 2026

What does this PR do?

Test Plan

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

leseb commented Mar 23, 2026

Uh oh!

jaideepr97 commented Mar 23, 2026

Uh oh!

jaideepr97 commented Mar 23, 2026

Uh oh!

bbrowning commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants