Skip to content

fix: provider_data_var context leak#5227

Merged
cdoern merged 2 commits intollamastack:mainfrom
jaideepr97:provider-data-leak
Mar 20, 2026
Merged

fix: provider_data_var context leak#5227
cdoern merged 2 commits intollamastack:mainfrom
jaideepr97:provider-data-leak

Conversation

@jaideepr97
Copy link
Copy Markdown
Contributor

@jaideepr97 jaideepr97 commented Mar 20, 2026

What does this PR do?

Following PR description is generated using claude:

PR #5168 fixed OTel trace context leaking into background workers, but PROVIDER_DATA_VAR — the ContextVar that carries authenticated user identity — suffers from the same asyncio.create_task copy semantics. When a background worker is spawned, it permanently inherits the spawning request's PROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means:

  • Chat completions written through the InferenceStore write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via AuthorizedSqlStore.
  • Responses processed through the OpenAIResponsesImpl background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership.

This PR generalizes the OTel-only utilities from #5168 into a unified RequestContext that captures both the OTel trace context and PROVIDER_DATA_VAR together. The three helpers in core/task.py are replaced:

Before (#5168) After (this PR)
capture_otel_context() capture_request_context() — snapshots OTel context and provider data
activate_otel_context(ctx) activate_request_context(ctx) — restores both per work-item
create_task_with_detached_otel_context(coro) create_detached_background_task(coro) — clears both before task creation

Both InferenceStore and OpenAIResponsesImpl are updated to capture a RequestContext at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace.

Closes #5221

Test Plan

  • tests/unit/core/test_task.py (10 tests): Verifies RequestContext capture/activate semantics, detached task isolation for both OTel and PROVIDER_DATA_VAR, caller context restoration, queue-based propagation patterns, and cross-contamination prevention.
  • tests/unit/utils/inference/test_provider_data_leak.py (1 test): Reproduces the InferenceStore write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via AuthorizedSqlStore access policies. This test fails without the fix.
  • tests/unit/providers/agents/builtin/test_responses_background.py (6 new tests):
    • TestResponsesOtelContextPropagation (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace.
    • TestResponsesProviderDataPropagation (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 20, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 20, 2026
Signed-off-by: Jaideep Rao <jrao@redhat.com>
@mergify mergify bot removed the needs-rebase label Mar 20, 2026
Signed-off-by: Jaideep Rao <jrao@redhat.com>
@iamemilio
Copy link
Copy Markdown
Contributor

LGTM. This was a good catch. Thanks!

request happened to spawn them. This inflates trace durations and bundles
unrelated DB operations under the wrong trace.
@dataclass
class RequestContext:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm looking at this, I wonder if this would've been useful to be in the API pkg if used by providers... not something to change in this PR though.

@cdoern cdoern merged commit 9b86ce8 into llamastack:main Mar 20, 2026
73 checks passed
@jaideepr97
Copy link
Copy Markdown
Contributor Author

@Mergifyio backport release-0.6.x

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 20, 2026

backport release-0.6.x

☑️ Command disallowed due to command restrictions in the Mergify configuration.

Details
  • sender-permission >= write

@jaideepr97
Copy link
Copy Markdown
Contributor Author

@cdoern please backport this

@leseb
Copy link
Copy Markdown
Collaborator

leseb commented Mar 23, 2026

@Mergifyio backport release-0.6.x

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 23, 2026

backport release-0.6.x

✅ Backports have been created

Details

Cherry-pick of 9b86ce8 has failed:

On branch mergify/bp/release-0.6.x/pr-5227
Your branch is up to date with 'origin/release-0.6.x'.

You are currently cherry-picking commit 9b86ce80.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   tests/unit/utils/inference/test_provider_data_leak.py

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   src/llama_stack/core/task.py
	both modified:   src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py
	both modified:   src/llama_stack/providers/utils/inference/inference_store.py
	deleted by us:   tests/unit/core/test_task.py
	deleted by us:   tests/unit/providers/agents/builtin/test_responses_background.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

mergify bot pushed a commit that referenced this pull request Mar 23, 2026
# What does this PR do?

Following PR description is generated using claude:

PR #5168 fixed OTel trace context leaking into background workers, but
`PROVIDER_DATA_VAR` — the `ContextVar` that carries authenticated user
identity — suffers from the same `asyncio.create_task` copy semantics.
When a background worker is spawned, it permanently inherits the
spawning request's `PROVIDER_DATA_VAR`, causing all subsequent DB writes
to be stamped with the wrong user's identity. In multi-tenant
deployments with auth enabled, this means:

- Chat completions written through the `InferenceStore` write queue get
attributed to whichever user's request first triggered worker creation,
breaking row-level access control via `AuthorizedSqlStore`.
- Responses processed through the `OpenAIResponsesImpl` background
worker pool run under the wrong user's identity, affecting status
updates, error handling, and stored response ownership.

This PR generalizes the OTel-only utilities from #5168 into a unified
`RequestContext` that captures **both** the OTel trace context and
`PROVIDER_DATA_VAR` together. The three helpers in `core/task.py` are
replaced:

| Before (#5168) | After (this PR) |
|---|---|
| `capture_otel_context()` | `capture_request_context()` — snapshots
OTel context **and** provider data |
| `activate_otel_context(ctx)` | `activate_request_context(ctx)` —
restores both per work-item |
| `create_task_with_detached_otel_context(coro)` |
`create_detached_background_task(coro)` — clears both before task
creation |

Both `InferenceStore` and `OpenAIResponsesImpl` are updated to capture a
`RequestContext` at enqueue time and activate it in the worker loop,
ensuring each work-item runs under the correct user identity and trace.

Closes #5221

## Test Plan

- **`tests/unit/core/test_task.py`** (10 tests): Verifies
`RequestContext` capture/activate semantics, detached task isolation for
both OTel and `PROVIDER_DATA_VAR`, caller context restoration,
queue-based propagation patterns, and cross-contamination prevention.
- **`tests/unit/utils/inference/test_provider_data_leak.py`** (1 test):
Reproduces the `InferenceStore` write queue leak end-to-end — two users
store completions through the async queue, then verifies each user can
only see their own completions via `AuthorizedSqlStore` access policies.
This test fails without the fix.
- **`tests/unit/providers/agents/builtin/test_responses_background.py`**
(6 new tests):
- `TestResponsesOtelContextPropagation` (3 tests): Verifies OTel trace
attribution through the responses background worker — each response is
processed under its originating request's trace, contexts don't leak
between items, and error handlers run under the correct trace.
- `TestResponsesProviderDataPropagation` (3 tests): Verifies user
identity propagation — each response runs as the correct user, identity
doesn't leak between queue items, and error-handling DB writes use the
correct user.

---------

Signed-off-by: Jaideep Rao <jrao@redhat.com>
(cherry picked from commit 9b86ce8)

# Conflicts:
#	src/llama_stack/core/task.py
#	src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py
#	src/llama_stack/providers/utils/inference/inference_store.py
#	tests/unit/core/test_task.py
#	tests/unit/providers/agents/builtin/test_responses_background.py
jaideepr97 added a commit to jaideepr97/llama-stack that referenced this pull request Mar 23, 2026
Backport of commit 9b86ce8 from main to release-0.6.x.

PROVIDER_DATA_VAR — the ContextVar that carries authenticated user
identity — leaks through asyncio.create_task copy semantics into
long-lived background workers. When a background worker is spawned, it
permanently inherits the spawning request's PROVIDER_DATA_VAR, causing
all subsequent DB writes to be stamped with the wrong user's identity.

This introduces a unified RequestContext in core/task.py that captures
both OTel trace context and PROVIDER_DATA_VAR together. Background
workers in InferenceStore and OpenAIResponsesImpl now capture context at
enqueue time and re-activate it per work-item, ensuring each operation
runs under the correct user identity and trace.

Adapted for release-0.6.x directory structure (meta_reference paths
instead of builtin).

Signed-off-by: Jaideep Rao <jrao@redhat.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PROVIDER_DATA_VAR context leak in asyncio.create_task

4 participants