fix: provider_data_var context leak#5227
Conversation
|
This pull request has merge conflicts that must be resolved before it can be merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Signed-off-by: Jaideep Rao <jrao@redhat.com>
7fa11d5 to
793ad1f
Compare
Signed-off-by: Jaideep Rao <jrao@redhat.com>
|
LGTM. This was a good catch. Thanks! |
| request happened to spawn them. This inflates trace durations and bundles | ||
| unrelated DB operations under the wrong trace. | ||
| @dataclass | ||
| class RequestContext: |
There was a problem hiding this comment.
hmmmm looking at this, I wonder if this would've been useful to be in the API pkg if used by providers... not something to change in this PR though.
|
@Mergifyio backport release-0.6.x |
☑️ Command disallowed due to command restrictions in the Mergify configuration.Details
|
|
@cdoern please backport this |
|
@Mergifyio backport release-0.6.x |
✅ Backports have been createdDetails
Cherry-pick of 9b86ce8 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
# What does this PR do? Following PR description is generated using claude: PR #5168 fixed OTel trace context leaking into background workers, but `PROVIDER_DATA_VAR` — the `ContextVar` that carries authenticated user identity — suffers from the same `asyncio.create_task` copy semantics. When a background worker is spawned, it permanently inherits the spawning request's `PROVIDER_DATA_VAR`, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means: - Chat completions written through the `InferenceStore` write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via `AuthorizedSqlStore`. - Responses processed through the `OpenAIResponsesImpl` background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership. This PR generalizes the OTel-only utilities from #5168 into a unified `RequestContext` that captures **both** the OTel trace context and `PROVIDER_DATA_VAR` together. The three helpers in `core/task.py` are replaced: | Before (#5168) | After (this PR) | |---|---| | `capture_otel_context()` | `capture_request_context()` — snapshots OTel context **and** provider data | | `activate_otel_context(ctx)` | `activate_request_context(ctx)` — restores both per work-item | | `create_task_with_detached_otel_context(coro)` | `create_detached_background_task(coro)` — clears both before task creation | Both `InferenceStore` and `OpenAIResponsesImpl` are updated to capture a `RequestContext` at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace. Closes #5221 ## Test Plan - **`tests/unit/core/test_task.py`** (10 tests): Verifies `RequestContext` capture/activate semantics, detached task isolation for both OTel and `PROVIDER_DATA_VAR`, caller context restoration, queue-based propagation patterns, and cross-contamination prevention. - **`tests/unit/utils/inference/test_provider_data_leak.py`** (1 test): Reproduces the `InferenceStore` write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via `AuthorizedSqlStore` access policies. This test fails without the fix. - **`tests/unit/providers/agents/builtin/test_responses_background.py`** (6 new tests): - `TestResponsesOtelContextPropagation` (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace. - `TestResponsesProviderDataPropagation` (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user. --------- Signed-off-by: Jaideep Rao <jrao@redhat.com> (cherry picked from commit 9b86ce8) # Conflicts: # src/llama_stack/core/task.py # src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py # src/llama_stack/providers/utils/inference/inference_store.py # tests/unit/core/test_task.py # tests/unit/providers/agents/builtin/test_responses_background.py
Backport of commit 9b86ce8 from main to release-0.6.x. PROVIDER_DATA_VAR — the ContextVar that carries authenticated user identity — leaks through asyncio.create_task copy semantics into long-lived background workers. When a background worker is spawned, it permanently inherits the spawning request's PROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. This introduces a unified RequestContext in core/task.py that captures both OTel trace context and PROVIDER_DATA_VAR together. Background workers in InferenceStore and OpenAIResponsesImpl now capture context at enqueue time and re-activate it per work-item, ensuring each operation runs under the correct user identity and trace. Adapted for release-0.6.x directory structure (meta_reference paths instead of builtin). Signed-off-by: Jaideep Rao <jrao@redhat.com> Made-with: Cursor
What does this PR do?
Following PR description is generated using claude:
PR #5168 fixed OTel trace context leaking into background workers, but
PROVIDER_DATA_VAR— theContextVarthat carries authenticated user identity — suffers from the sameasyncio.create_taskcopy semantics. When a background worker is spawned, it permanently inherits the spawning request'sPROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means:InferenceStorewrite queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control viaAuthorizedSqlStore.OpenAIResponsesImplbackground worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership.This PR generalizes the OTel-only utilities from #5168 into a unified
RequestContextthat captures both the OTel trace context andPROVIDER_DATA_VARtogether. The three helpers incore/task.pyare replaced:capture_otel_context()capture_request_context()— snapshots OTel context and provider dataactivate_otel_context(ctx)activate_request_context(ctx)— restores both per work-itemcreate_task_with_detached_otel_context(coro)create_detached_background_task(coro)— clears both before task creationBoth
InferenceStoreandOpenAIResponsesImplare updated to capture aRequestContextat enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace.Closes #5221
Test Plan
tests/unit/core/test_task.py(10 tests): VerifiesRequestContextcapture/activate semantics, detached task isolation for both OTel andPROVIDER_DATA_VAR, caller context restoration, queue-based propagation patterns, and cross-contamination prevention.tests/unit/utils/inference/test_provider_data_leak.py(1 test): Reproduces theInferenceStorewrite queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions viaAuthorizedSqlStoreaccess policies. This test fails without the fix.tests/unit/providers/agents/builtin/test_responses_background.py(6 new tests):TestResponsesOtelContextPropagation(3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace.TestResponsesProviderDataPropagation(3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user.