Skip to content

fix: address unresolved review comments from PyPDF File Processor PR#4743#5173

Open
RobuRishabh wants to merge 16 commits intollamastack:mainfrom
RobuRishabh:RHAIENG-1823-Address-Unresolved-Reviews
Open

fix: address unresolved review comments from PyPDF File Processor PR#4743#5173
RobuRishabh wants to merge 16 commits intollamastack:mainfrom
RobuRishabh:RHAIENG-1823-Address-Unresolved-Reviews

Conversation

@RobuRishabh
Copy link
Copy Markdown

What does this PR do?

Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.

Key changes:

  • Remove legacy chunking fallback: Eliminate the _legacy_chunk_file method and all fallback paths from OpenAIVectorStoreMixin. The system now raises a clear RuntimeError if file_processor_api is not configured, instead of silently degrading to legacy inline parsing.
  • Wire file_processor_api through all vector_io providers: Add Api.file_processors to optional_api_dependencies in the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.
  • Make files_api required in PyPDF constructors: Remove the default None from both PyPDFFileProcessorAdapter and PyPDFFileProcessor, and use deps[Api.files] (bracket access) in the factory to fail fast if somehow missing.
  • Bounded file reads: Implemented chunked reading for direct uploads to cap memory usage, and add a size check on the file_id retrieval path against max_file_size_bytes.
  • Clear error for missing file_id: Wrap openai_retrieve_file in a try/except that surfaces a ValueError("File with id '...' not found"), with a new test covering this case.
  • Conditional text cleaning: Make the .strip() whitespace-only page filter conditional on the clean_text config setting.
  • Dead code cleanup: Remove unused file_processor_api field from VectorStoreWithIndex and the now-unused make_overlapped_chunks import from the mixin.

Closes #4743

Test Plan

Automated tests

1. Unit tests (mixin + vector_io)

KMP_DUPLICATE_LIB_OK=TRUE uv run --group unit pytest -sv tests/unit/

All test_contextual_retrieval.py (16 tests) and test_vector_store_config_registration.py tests — these exercise the refactored OpenAIVectorStoreMixin.

2. PyPDF file processor tests (20/20 pass)

uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py

3. Full integration suite (replay mode)

uv run --group test pytest -sv tests/integration/ --stack-config=starter

Result: 4 failed, 54 passed, 639 skipped, 1 xfailed

All 4 failures are pre-existing and unrelated:

  • test_safety_with_image — Pydantic schema mismatch (type: 'image' vs 'image_url')
  • test_starter_distribution_config_loads_and_resolves / test_postgres_demo_distribution_config_loads — relative path FileNotFoundError
  • test_mcp_tools_list_with_schemas — no local MCP server (Connection refused)

No regressions in vector_io, file_search, or ingestion workflows.

Manual E2E verification (with starter distro)

export OLLAMA_URL="http://localhost:11434/v1"
llama stack run starter

1. Verify route is registered:

curl -sS http://localhost:8321/v1/inspect/routes \
  | jq -r '.data[] | select(.route|test("file-processors";"i"))'

Expected:

{
  "route": "/v1alpha/file-processors/process",
  "method": "POST",
  "provider_types": [
    "inline::pypdf"
  ]
}

2. Verify OpenAPI contains the endpoint:

curl -sS http://localhost:8321/openapi.json | rg -n "/v1alpha/file-processors/process|file-processors" -i

3. Direct file upload:

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file=@/path/to/sample.pdf" | jq

Expected: chunks response with metadata.processor = "pypdf".

4. Via file_id:

FILE_ID="$(curl -sS -X POST http://localhost:8321/v1/files \
  -F "file=@/path/to/sample.pdf" -F "purpose=assistants" | jq -r .id)"

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file_id=${FILE_ID}" | jq

Expected: chunks response with metadata.processor = "pypdf" and file_id in chunk metadata.

@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 17, 2026

Hi @RobuRishabh!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 17, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026
…lamastack#4743

- Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin;
  raise RuntimeError if FileProcessor API is not configured
- Wire file_processor_api through all vector_io providers (registry, factories,
  adapter constructors)
- Make files_api required in PyPDF adapter and processor constructors
- Implement chunked file reading (64KB) for direct uploads to cap memory usage
- Add size check on file_id retrieval path against max_file_size_bytes
- Wrap openai_retrieve_file in try/except to surface clear ValueError for
  missing file_id, with test coverage
- Make .strip() page filter conditional on clean_text config
- Remove unused file_processor_api field from VectorStoreWithIndex
- Clean up dead imports (make_overlapped_chunks) from mixin
- fixed linters, formats using pre-commit checks
- fixed pypdf to handle .txt files

Signed-off-by: roburishabh <roburishabh@outlook.com>
@RobuRishabh RobuRishabh force-pushed the RHAIENG-1823-Address-Unresolved-Reviews branch from 43b1105 to 1eb5352 Compare March 17, 2026 17:36
with pytest.raises(ValueError, match="Cannot provide both file and file_id"):
await processor.process_file(file=upload_file, file_id="test_id")

async def test_file_id_without_files_api(self, processor: PyPDFFileProcessor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you removed a test here, is that purposeful?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so following up on this review #4743 (comment), needed to make the files_api a required parameter in both PyPDFFileProcessor and PyPDFFileProcessorAdapter. So now you must provide it when creating a processor. Since it's always there, the "if there's no files_api" check was pointless, so I removed it and replaced it with what happens when a user gives a file ID that doesn't exist

@RobuRishabh RobuRishabh requested a review from cdoern March 17, 2026 19:52
Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.

def __init__(self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None) -> None:
super().__init__(inference_api=inference_api, files_api=files_api, kvstore=None)
def __init__(
self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None
self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api FileProcessor | None

I think

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for all other vector IO providers you added this to

Signed-off-by: roburishabh <roburishabh@outlook.com>
  - Add MIME type parsing safety check to prevent IndexError
  - Document chunked file reading approach and rationale
  - Make file_processors a hard dependency for all vector_io providers
  - Add unit test for missing file_processor_api error handling

Signed-off-by: roburishabh <roburishabh@outlook.com>
@RobuRishabh RobuRishabh requested a review from cdoern March 19, 2026 17:08
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 25, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 25, 2026
Remove duplicate legacy chunking code that was incorrectly merged
alongside the new FileProcessor API path, and fix incomplete
RuntimeError syntax. Also remove unused make_overlapped_chunks import

Signed-off-by: roburishabh <roburishabh@outlook.com>
@mergify mergify bot removed the needs-rebase label Apr 1, 2026
…PI changes

Add missing docstrings to FaissVectorIOAdapter and WeaviateVectorIOAdapter
to fix ruff D101. Replace f-string logging with structured key-value style
in openai_vector_store_mixin. Update test_openai_vector_store_mixin to
implement new abstract methods and use renamed openai_attach_file_to_vector_store
API with proper mock setup.

Signed-off-by: roburishabh <roburishabh@outlook.com>
Made-with: Cursor
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 2, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 2, 2026
@mergify mergify bot removed the needs-rebase label Apr 2, 2026
content_parts = []
bytes_read = 0
while True:
chunk = await file.read(64 * 1024) # 64KB chunks for efficient I/O
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo we should move this magic number somewhere please and make it a variable to explicitly name it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do it.

)
return vector_store_file_object

if isinstance(chunking_strategy, VectorStoreChunkingStrategyStatic):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait why. did we remove this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_chunk_size_tokens and chunk_overlap_tokens variables that were extracted here were only consumed by _legacy_chunk_file(). Since this PR removes the legacy chunking path entirely (see original review comment from #4743), those variables have no consumers left.

The chunking strategy object is now passed directly to the FileProcessor API:

pf_resp = await self.file_processor_api.process_file(
    ProcessFileRequest(file_id=file_id, chunking_strategy=chunking_strategy)
)

The PyPDF processor handles extracting chunk size/overlap internally in its _create_chunks() method.

The only part we keep is the VectorStoreChunkingStrategyContextual check because it validates that model_id is configured, that's a fail-fast check that still runs in the mixin before handing off to the FileProcessor.

chunk_attributes["filename"] = file_response.filename
chunk_attributes["file_id"] = file_id

# Try using FileProcessor API if available
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we forcing the file processor api? shouldn't it be opt-in?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in PR #4743 reviewer asked us that the File Processor API should be treated as a required dependency, not optional.

vector_store: VectorStore
index: EmbeddingIndex
inference_api: Inference
file_processor_api: Any = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_processor_api is never read from any VectorStoreWithIndex instance anywhere in the codebase, it's only accessed via self.file_processor_api on the mixin. This was dead code left over from the original #4743 PR, so i removed it

Signed-off-by: roburishabh <roburishabh@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants