Skip to content

fix: provider_data_var context leak (backport #5227)#5247

Closed
mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
mergify/bp/release-0.6.x/pr-5227
Closed

fix: provider_data_var context leak (backport #5227)#5247
mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
mergify/bp/release-0.6.x/pr-5227

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Mar 23, 2026

What does this PR do?

Following PR description is generated using claude:

PR #5168 fixed OTel trace context leaking into background workers, but PROVIDER_DATA_VAR — the ContextVar that carries authenticated user identity — suffers from the same asyncio.create_task copy semantics. When a background worker is spawned, it permanently inherits the spawning request's PROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means:

  • Chat completions written through the InferenceStore write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via AuthorizedSqlStore.
  • Responses processed through the OpenAIResponsesImpl background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership.

This PR generalizes the OTel-only utilities from #5168 into a unified RequestContext that captures both the OTel trace context and PROVIDER_DATA_VAR together. The three helpers in core/task.py are replaced:

Before (#5168) After (this PR)
capture_otel_context() capture_request_context() — snapshots OTel context and provider data
activate_otel_context(ctx) activate_request_context(ctx) — restores both per work-item
create_task_with_detached_otel_context(coro) create_detached_background_task(coro) — clears both before task creation

Both InferenceStore and OpenAIResponsesImpl are updated to capture a RequestContext at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace.

Closes #5221

Test Plan

  • tests/unit/core/test_task.py (10 tests): Verifies RequestContext capture/activate semantics, detached task isolation for both OTel and PROVIDER_DATA_VAR, caller context restoration, queue-based propagation patterns, and cross-contamination prevention.
  • tests/unit/utils/inference/test_provider_data_leak.py (1 test): Reproduces the InferenceStore write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via AuthorizedSqlStore access policies. This test fails without the fix.
  • tests/unit/providers/agents/builtin/test_responses_background.py (6 new tests):
    • TestResponsesOtelContextPropagation (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace.
    • TestResponsesProviderDataPropagation (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user.
      This is an automatic backport of pull request fix: provider_data_var context leak #5227 done by Mergify.

# What does this PR do?

Following PR description is generated using claude:

PR #5168 fixed OTel trace context leaking into background workers, but
`PROVIDER_DATA_VAR` — the `ContextVar` that carries authenticated user
identity — suffers from the same `asyncio.create_task` copy semantics.
When a background worker is spawned, it permanently inherits the
spawning request's `PROVIDER_DATA_VAR`, causing all subsequent DB writes
to be stamped with the wrong user's identity. In multi-tenant
deployments with auth enabled, this means:

- Chat completions written through the `InferenceStore` write queue get
attributed to whichever user's request first triggered worker creation,
breaking row-level access control via `AuthorizedSqlStore`.
- Responses processed through the `OpenAIResponsesImpl` background
worker pool run under the wrong user's identity, affecting status
updates, error handling, and stored response ownership.

This PR generalizes the OTel-only utilities from #5168 into a unified
`RequestContext` that captures **both** the OTel trace context and
`PROVIDER_DATA_VAR` together. The three helpers in `core/task.py` are
replaced:

| Before (#5168) | After (this PR) |
|---|---|
| `capture_otel_context()` | `capture_request_context()` — snapshots
OTel context **and** provider data |
| `activate_otel_context(ctx)` | `activate_request_context(ctx)` —
restores both per work-item |
| `create_task_with_detached_otel_context(coro)` |
`create_detached_background_task(coro)` — clears both before task
creation |

Both `InferenceStore` and `OpenAIResponsesImpl` are updated to capture a
`RequestContext` at enqueue time and activate it in the worker loop,
ensuring each work-item runs under the correct user identity and trace.

Closes #5221

## Test Plan

- **`tests/unit/core/test_task.py`** (10 tests): Verifies
`RequestContext` capture/activate semantics, detached task isolation for
both OTel and `PROVIDER_DATA_VAR`, caller context restoration,
queue-based propagation patterns, and cross-contamination prevention.
- **`tests/unit/utils/inference/test_provider_data_leak.py`** (1 test):
Reproduces the `InferenceStore` write queue leak end-to-end — two users
store completions through the async queue, then verifies each user can
only see their own completions via `AuthorizedSqlStore` access policies.
This test fails without the fix.
- **`tests/unit/providers/agents/builtin/test_responses_background.py`**
(6 new tests):
- `TestResponsesOtelContextPropagation` (3 tests): Verifies OTel trace
attribution through the responses background worker — each response is
processed under its originating request's trace, contexts don't leak
between items, and error handlers run under the correct trace.
- `TestResponsesProviderDataPropagation` (3 tests): Verifies user
identity propagation — each response runs as the correct user, identity
doesn't leak between queue items, and error-handling DB writes use the
correct user.

---------

Signed-off-by: Jaideep Rao <jrao@redhat.com>
(cherry picked from commit 9b86ce8)

# Conflicts:
#	src/llama_stack/core/task.py
#	src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py
#	src/llama_stack/providers/utils/inference/inference_store.py
#	tests/unit/core/test_task.py
#	tests/unit/providers/agents/builtin/test_responses_background.py
@mergify
Copy link
Copy Markdown
Contributor Author

mergify bot commented Mar 23, 2026

Cherry-pick of 9b86ce8 has failed:

On branch mergify/bp/release-0.6.x/pr-5227
Your branch is up to date with 'origin/release-0.6.x'.

You are currently cherry-picking commit 9b86ce80.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   tests/unit/utils/inference/test_provider_data_leak.py

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   src/llama_stack/core/task.py
	both modified:   src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py
	both modified:   src/llama_stack/providers/utils/inference/inference_store.py
	deleted by us:   tests/unit/core/test_task.py
	deleted by us:   tests/unit/providers/agents/builtin/test_responses_background.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot added the conflicts label Mar 23, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026
@leseb
Copy link
Copy Markdown
Collaborator

leseb commented Mar 23, 2026

@jaideepr97 please fix conflicts or open up a new PR against 0.6 if you don't have permsisions, thanks!

@jaideepr97
Copy link
Copy Markdown
Contributor

@leseb I don't have permissions to push to a mergify branch
raised a diff PR here: #5250
though this builds heavily on top of #5228 so we'll want to get that one in first and then I can rebase mine

@jaideepr97
Copy link
Copy Markdown
Contributor

@leseb
@iamemilio is ok with just merging that one since it also covers his backport
#5250 (comment)

@bbrowning
Copy link
Copy Markdown
Collaborator

Closing this one in favor of the reference #5250.

@bbrowning bbrowning closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. conflicts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants