fix: address unresolved review comments from PyPDF File Processor PR#4743 by RobuRishabh · Pull Request #5173 · llamastack/llama-stack

RobuRishabh · 2026-03-17T05:50:29Z

What does this PR do?

Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.

Key changes:

Remove legacy chunking fallback: Eliminate the _legacy_chunk_file method and all fallback paths from OpenAIVectorStoreMixin. The system now raises a clear RuntimeError if file_processor_api is not configured, instead of silently degrading to legacy inline parsing.
Wire file_processor_api through all vector_io providers: Add Api.file_processors to optional_api_dependencies in the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.
Make files_api required in PyPDF constructors: Remove the default None from both PyPDFFileProcessorAdapter and PyPDFFileProcessor, and use deps[Api.files] (bracket access) in the factory to fail fast if somehow missing.
Bounded file reads: Implemented chunked reading for direct uploads to cap memory usage, and add a size check on the file_id retrieval path against max_file_size_bytes.
Clear error for missing file_id: Wrap openai_retrieve_file in a try/except that surfaces a ValueError("File with id '...' not found"), with a new test covering this case.
Conditional text cleaning: Make the .strip() whitespace-only page filter conditional on the clean_text config setting.
Dead code cleanup: Remove unused file_processor_api field from VectorStoreWithIndex and the now-unused make_overlapped_chunks import from the mixin.

Closes #4743

Test Plan

Automated tests

1. Unit tests (mixin + vector_io)

KMP_DUPLICATE_LIB_OK=TRUE uv run --group unit pytest -sv tests/unit/

All test_contextual_retrieval.py (16 tests) and test_vector_store_config_registration.py tests — these exercise the refactored OpenAIVectorStoreMixin.

2. PyPDF file processor tests (20/20 pass)

uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py

3. Full integration suite (replay mode)

uv run --group test pytest -sv tests/integration/ --stack-config=starter

Result: 4 failed, 54 passed, 639 skipped, 1 xfailed

All 4 failures are pre-existing and unrelated:

test_safety_with_image — Pydantic schema mismatch (type: 'image' vs 'image_url')
test_starter_distribution_config_loads_and_resolves / test_postgres_demo_distribution_config_loads — relative path FileNotFoundError
test_mcp_tools_list_with_schemas — no local MCP server (Connection refused)

No regressions in vector_io, file_search, or ingestion workflows.

Manual E2E verification (with starter distro)

export OLLAMA_URL="http://localhost:11434/v1"
llama stack run starter

1. Verify route is registered:

curl -sS http://localhost:8321/v1/inspect/routes \
  | jq -r '.data[] | select(.route|test("file-processors";"i"))'

Expected:

{
  "route": "/v1alpha/file-processors/process",
  "method": "POST",
  "provider_types": [
    "inline::pypdf"
  ]
}

2. Verify OpenAPI contains the endpoint:

curl -sS http://localhost:8321/openapi.json | rg -n "/v1alpha/file-processors/process|file-processors" -i

3. Direct file upload:

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file=@/path/to/sample.pdf" | jq

Expected: chunks response with metadata.processor = "pypdf".

4. Via file_id:

FILE_ID="$(curl -sS -X POST http://localhost:8321/v1/files \
  -F "file=@/path/to/sample.pdf" -F "purpose=assistants" | jq -r .id)"

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file_id=${FILE_ID}" | jq

Expected: chunks response with metadata.processor = "pypdf" and file_id in chunk metadata.

meta-cla · 2026-03-17T05:50:34Z

Hi @RobuRishabh!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

meta-cla · 2026-03-17T07:07:50Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

…lamastack#4743 - Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin; raise RuntimeError if FileProcessor API is not configured - Wire file_processor_api through all vector_io providers (registry, factories, adapter constructors) - Make files_api required in PyPDF adapter and processor constructors - Implement chunked file reading (64KB) for direct uploads to cap memory usage - Add size check on file_id retrieval path against max_file_size_bytes - Wrap openai_retrieve_file in try/except to surface clear ValueError for missing file_id, with test coverage - Make .strip() page filter conditional on clean_text config - Remove unused file_processor_api field from VectorStoreWithIndex - Clean up dead imports (make_overlapped_chunks) from mixin - fixed linters, formats using pre-commit checks - fixed pypdf to handle .txt files Signed-off-by: roburishabh <roburishabh@outlook.com>

cdoern · 2026-03-17T17:38:12Z

tests/integration/file_processors/test_pypdf_processor.py

        with pytest.raises(ValueError, match="Cannot provide both file and file_id"):
            await processor.process_file(file=upload_file, file_id="test_id")

-    async def test_file_id_without_files_api(self, processor: PyPDFFileProcessor):


looks like you removed a test here, is that purposeful?

Yes, so following up on this review #4743 (comment), needed to make the files_api a required parameter in both PyPDFFileProcessor and PyPDFFileProcessorAdapter. So now you must provide it when creating a processor. Since it's always there, the "if there's no files_api" check was pointless, so I removed it and replaced it with what happens when a user gives a file ID that doesn't exist

cdoern

I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.

cdoern · 2026-03-18T18:55:01Z

src/llama_stack/providers/inline/vector_io/faiss/faiss.py

-    def __init__(self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None) -> None:
-        super().__init__(inference_api=inference_api, files_api=files_api, kvstore=None)
+    def __init__(
+        self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None


Suggested change

self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None

self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api FileProcessor | None

I think

same for all other vector IO providers you added this to

…s-Unresolved-Reviews

Signed-off-by: roburishabh <roburishabh@outlook.com>

…s-Unresolved-Reviews

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

…s-Unresolved-Reviews

- Add MIME type parsing safety check to prevent IndexError - Document chunked file reading approach and rationale - Make file_processors a hard dependency for all vector_io providers - Add unit test for missing file_processor_api error handling Signed-off-by: roburishabh <roburishabh@outlook.com>

mergify · 2026-03-25T08:46:45Z

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Remove duplicate legacy chunking code that was incorrectly merged alongside the new FileProcessor API path, and fix incomplete RuntimeError syntax. Also remove unused make_overlapped_chunks import Signed-off-by: roburishabh <roburishabh@outlook.com>

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

…PI changes Add missing docstrings to FaissVectorIOAdapter and WeaviateVectorIOAdapter to fix ruff D101. Replace f-string logging with structured key-value style in openai_vector_store_mixin. Update test_openai_vector_store_mixin to implement new abstract methods and use renamed openai_attach_file_to_vector_store API with proper mock setup. Signed-off-by: roburishabh <roburishabh@outlook.com> Made-with: Cursor

mergify · 2026-04-02T20:15:14Z

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…s-Unresolved-Reviews

franciscojavierarceo · 2026-04-03T15:38:29Z

src/llama_stack/providers/inline/file_processor/pypdf/pypdf.py

+            content_parts = []
+            bytes_read = 0
+            while True:
+                chunk = await file.read(64 * 1024)  # 64KB chunks for efficient I/O


ooo we should move this magic number somewhere please and make it a variable to explicitly name it

ok, will do it.

franciscojavierarceo · 2026-04-03T15:39:57Z

src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py

            )
            return vector_store_file_object

-        if isinstance(chunking_strategy, VectorStoreChunkingStrategyStatic):


wait why. did we remove this?

The max_chunk_size_tokens and chunk_overlap_tokens variables that were extracted here were only consumed by _legacy_chunk_file(). Since this PR removes the legacy chunking path entirely (see original review comment from #4743), those variables have no consumers left.

The chunking strategy object is now passed directly to the FileProcessor API:

pf_resp = await self.file_processor_api.process_file( ProcessFileRequest(file_id=file_id, chunking_strategy=chunking_strategy) )

The PyPDF processor handles extracting chunk size/overlap internally in its _create_chunks() method.

The only part we keep is the VectorStoreChunkingStrategyContextual check because it validates that model_id is configured, that's a fail-fast check that still runs in the mixin before handing off to the FileProcessor.

franciscojavierarceo · 2026-04-03T15:40:49Z

src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py

            chunk_attributes["filename"] = file_response.filename
            chunk_attributes["file_id"] = file_id

-            # Try using FileProcessor API if available


why are we forcing the file processor api? shouldn't it be opt-in?

in PR #4743 reviewer asked us that the File Processor API should be treated as a required dependency, not optional.

franciscojavierarceo · 2026-04-03T15:41:09Z

src/llama_stack/providers/utils/memory/vector_store.py

    vector_store: VectorStore
    index: EmbeddingIndex
    inference_api: Inference
-    file_processor_api: Any = None


why remove?

file_processor_api is never read from any VectorStoreWithIndex instance anywhere in the codebase, it's only accessed via self.file_processor_api on the mixin. This was dead code left over from the original #4743 PR, so i removed it

Signed-off-by: roburishabh <roburishabh@outlook.com>

RobuRishabh requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 17, 2026 05:50

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026

RobuRishabh force-pushed the RHAIENG-1823-Address-Unresolved-Reviews branch from 43b1105 to 1eb5352 Compare March 17, 2026 17:36

cdoern reviewed Mar 17, 2026

View reviewed changes

RobuRishabh requested a review from cdoern March 17, 2026 19:52

mergify bot added 2 commits March 17, 2026 21:45

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

9f453c9

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

8838dcf

cdoern reviewed Mar 18, 2026

View reviewed changes

RobuRishabh added 6 commits March 18, 2026 15:13

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

74a8de0

…s-Unresolved-Reviews

fix: add file_processors to vector IO integration test --stack-config

ba12fe7

Signed-off-by: roburishabh <roburishabh@outlook.com>

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

c087c33

…s-Unresolved-Reviews

Merge branch 'RHAIENG-1823-Address-Unresolved-Reviews' of https://git…

d216c3b

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

dc949b3

…s-Unresolved-Reviews

RobuRishabh requested a review from cdoern March 19, 2026 17:08

cdoern and others added 2 commits March 20, 2026 10:09

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

349246f

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

38cc121

mergify bot added the needs-rebase label Mar 25, 2026

Merge branch 'RHAIENG-1823-Address-Unresolved-Reviews' of https://git…

3bfde3d

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

mergify bot removed the needs-rebase label Apr 1, 2026

mergify bot added the needs-rebase label Apr 2, 2026

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

8e62fd1

…s-Unresolved-Reviews

mergify bot removed the needs-rebase label Apr 2, 2026

franciscojavierarceo reviewed Apr 3, 2026

View reviewed changes

extracted magic number to a named constant

90a8808

Signed-off-by: roburishabh <roburishabh@outlook.com>

	self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files \| None, file_processor_api=None
	self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files \| None, file_processor_api FileProcessor \| None

Conversation

RobuRishabh commented Mar 17, 2026

What does this PR do?

Test Plan

Automated tests

Manual E2E verification (with starter distro)

Uh oh!

meta-cla bot commented Mar 17, 2026

Action Required

Process

Uh oh!

meta-cla bot commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

mergify bot commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants