fix: provider_data_var context leak (backport #5227)#5247
Closed
mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
Closed
fix: provider_data_var context leak (backport #5227)#5247mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
mergify[bot] wants to merge 1 commit intorelease-0.6.xfrom
Conversation
# What does this PR do? Following PR description is generated using claude: PR #5168 fixed OTel trace context leaking into background workers, but `PROVIDER_DATA_VAR` — the `ContextVar` that carries authenticated user identity — suffers from the same `asyncio.create_task` copy semantics. When a background worker is spawned, it permanently inherits the spawning request's `PROVIDER_DATA_VAR`, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means: - Chat completions written through the `InferenceStore` write queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control via `AuthorizedSqlStore`. - Responses processed through the `OpenAIResponsesImpl` background worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership. This PR generalizes the OTel-only utilities from #5168 into a unified `RequestContext` that captures **both** the OTel trace context and `PROVIDER_DATA_VAR` together. The three helpers in `core/task.py` are replaced: | Before (#5168) | After (this PR) | |---|---| | `capture_otel_context()` | `capture_request_context()` — snapshots OTel context **and** provider data | | `activate_otel_context(ctx)` | `activate_request_context(ctx)` — restores both per work-item | | `create_task_with_detached_otel_context(coro)` | `create_detached_background_task(coro)` — clears both before task creation | Both `InferenceStore` and `OpenAIResponsesImpl` are updated to capture a `RequestContext` at enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace. Closes #5221 ## Test Plan - **`tests/unit/core/test_task.py`** (10 tests): Verifies `RequestContext` capture/activate semantics, detached task isolation for both OTel and `PROVIDER_DATA_VAR`, caller context restoration, queue-based propagation patterns, and cross-contamination prevention. - **`tests/unit/utils/inference/test_provider_data_leak.py`** (1 test): Reproduces the `InferenceStore` write queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions via `AuthorizedSqlStore` access policies. This test fails without the fix. - **`tests/unit/providers/agents/builtin/test_responses_background.py`** (6 new tests): - `TestResponsesOtelContextPropagation` (3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace. - `TestResponsesProviderDataPropagation` (3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user. --------- Signed-off-by: Jaideep Rao <jrao@redhat.com> (cherry picked from commit 9b86ce8) # Conflicts: # src/llama_stack/core/task.py # src/llama_stack/providers/inline/agents/meta_reference/responses/openai_responses.py # src/llama_stack/providers/utils/inference/inference_store.py # tests/unit/core/test_task.py # tests/unit/providers/agents/builtin/test_responses_background.py
Contributor
Author
|
Cherry-pick of 9b86ce8 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Collaborator
|
@jaideepr97 please fix conflicts or open up a new PR against 0.6 if you don't have permsisions, thanks! |
Contributor
Contributor
|
@leseb |
Collaborator
|
Closing this one in favor of the reference #5250. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Following PR description is generated using claude:
PR #5168 fixed OTel trace context leaking into background workers, but
PROVIDER_DATA_VAR— theContextVarthat carries authenticated user identity — suffers from the sameasyncio.create_taskcopy semantics. When a background worker is spawned, it permanently inherits the spawning request'sPROVIDER_DATA_VAR, causing all subsequent DB writes to be stamped with the wrong user's identity. In multi-tenant deployments with auth enabled, this means:InferenceStorewrite queue get attributed to whichever user's request first triggered worker creation, breaking row-level access control viaAuthorizedSqlStore.OpenAIResponsesImplbackground worker pool run under the wrong user's identity, affecting status updates, error handling, and stored response ownership.This PR generalizes the OTel-only utilities from #5168 into a unified
RequestContextthat captures both the OTel trace context andPROVIDER_DATA_VARtogether. The three helpers incore/task.pyare replaced:capture_otel_context()capture_request_context()— snapshots OTel context and provider dataactivate_otel_context(ctx)activate_request_context(ctx)— restores both per work-itemcreate_task_with_detached_otel_context(coro)create_detached_background_task(coro)— clears both before task creationBoth
InferenceStoreandOpenAIResponsesImplare updated to capture aRequestContextat enqueue time and activate it in the worker loop, ensuring each work-item runs under the correct user identity and trace.Closes #5221
Test Plan
tests/unit/core/test_task.py(10 tests): VerifiesRequestContextcapture/activate semantics, detached task isolation for both OTel andPROVIDER_DATA_VAR, caller context restoration, queue-based propagation patterns, and cross-contamination prevention.tests/unit/utils/inference/test_provider_data_leak.py(1 test): Reproduces theInferenceStorewrite queue leak end-to-end — two users store completions through the async queue, then verifies each user can only see their own completions viaAuthorizedSqlStoreaccess policies. This test fails without the fix.tests/unit/providers/agents/builtin/test_responses_background.py(6 new tests):TestResponsesOtelContextPropagation(3 tests): Verifies OTel trace attribution through the responses background worker — each response is processed under its originating request's trace, contexts don't leak between items, and error handlers run under the correct trace.TestResponsesProviderDataPropagation(3 tests): Verifies user identity propagation — each response runs as the correct user, identity doesn't leak between queue items, and error-handling DB writes use the correct user.This is an automatic backport of pull request fix: provider_data_var context leak #5227 done by Mergify.