Skip to content

Commit 5d63477

Browse files
RobuRishabhmergify[bot]cdoerngithub-actions[bot]franciscojavierarceo
authored
fix: address unresolved review comments from PyPDF File Processor PR#4743 (#5173)
# What does this PR do? Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations. **Key changes:** - **Remove legacy chunking fallback:** Eliminate the `_legacy_chunk_file` method and all fallback paths from `OpenAIVectorStoreMixin`. The system now raises a clear `RuntimeError` if `file_processor_api` is not configured, instead of silently degrading to legacy inline parsing. - **Wire `file_processor_api` through all vector_io providers:** Add `Api.file_processors` to `optional_api_dependencies` in the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors. - **Make `files_api` required in PyPDF constructors:** Remove the default `None` from both `PyPDFFileProcessorAdapter` and `PyPDFFileProcessor`, and use `deps[Api.files]` (bracket access) in the factory to fail fast if somehow missing. - **Bounded file reads:** Implemented chunked reading for direct uploads to cap memory usage, and add a size check on the `file_id` retrieval path against `max_file_size_bytes`. - **Clear error for missing `file_id`:** Wrap `openai_retrieve_file` in a `try/except` that surfaces a `ValueError("File with id '...' not found")`, with a new test covering this case. - **Conditional text cleaning:** Make the `.strip()` whitespace-only page filter conditional on the `clean_text` config setting. - **Dead code cleanup:** Remove unused `file_processor_api` field from `VectorStoreWithIndex` and the now-unused `make_overlapped_chunks` import from the mixin. Closes #4743 ## Test Plan ### Automated tests **1. Unit tests (mixin + vector_io)** ```bash KMP_DUPLICATE_LIB_OK=TRUE uv run --group unit pytest -sv tests/unit/ ``` All `test_contextual_retrieval.py` (16 tests) and `test_vector_store_config_registration.py` tests — these exercise the refactored `OpenAIVectorStoreMixin`. **2. PyPDF file processor tests (20/20 pass)** ```bash uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py ``` **3. Full integration suite (replay mode)** ```bash uv run --group test pytest -sv tests/integration/ --stack-config=starter ``` Result: **4 failed, 54 passed, 639 skipped, 1 xfailed** All 4 failures are pre-existing and unrelated: - `test_safety_with_image` — Pydantic schema mismatch (`type: 'image'` vs `'image_url'`) - `test_starter_distribution_config_loads_and_resolves` / `test_postgres_demo_distribution_config_loads` — relative path `FileNotFoundError` - `test_mcp_tools_list_with_schemas` — no local MCP server (`Connection refused`) No regressions in vector_io, file_search, or ingestion workflows. ### Manual E2E verification (with starter distro) ```bash export OLLAMA_URL="http://localhost:11434/v1" llama stack run starter ``` **1. Verify route is registered:** ```bash curl -sS http://localhost:8321/v1/inspect/routes \ | jq -r '.data[] | select(.route|test("file-processors";"i"))' ``` Expected: ```json { "route": "/v1alpha/file-processors/process", "method": "POST", "provider_types": [ "inline::pypdf" ] } ``` **2. Verify OpenAPI contains the endpoint:** ```bash curl -sS http://localhost:8321/openapi.json | rg -n "/v1alpha/file-processors/process|file-processors" -i ``` **3. Direct file upload:** ```bash curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \ -F "file=@/path/to/sample.pdf" | jq ``` Expected: chunks response with `metadata.processor = "pypdf"`. **4. Via file_id:** ```bash FILE_ID="$(curl -sS -X POST http://localhost:8321/v1/files \ -F "file=@/path/to/sample.pdf" -F "purpose=assistants" | jq -r .id)" curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \ -F "file_id=${FILE_ID}" | jq ``` Expected: chunks response with `metadata.processor = "pypdf"` and `file_id` in chunk metadata. --------- Signed-off-by: roburishabh <roburishabh@outlook.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Charlie Doern <cdoern@redhat.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
1 parent 8d9ef64 commit 5d63477

144 files changed

Lines changed: 157208 additions & 38578 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/integration-vector-io-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ jobs:
215215
INFINISPAN_PASSWORD: ${{ matrix.vector-io-provider == 'remote::infinispan' && 'password' || '' }}
216216
run: |
217217
uv run --no-sync \
218-
pytest -sv --stack-config="files=inline::localfs,inference=inline::sentence-transformers?trust_remote_code=true,vector_io=${{ matrix.vector-io-provider }}" \
218+
pytest -sv --stack-config="files=inline::localfs,inference=inline::sentence-transformers?trust_remote_code=true,vector_io=${{ matrix.vector-io-provider }},file_processors=inline::pypdf" \
219219
tests/integration/vector_io
220220
221221
- name: Check Storage and Memory Available After Tests

docs/docs/providers/file_processors/inline_pypdf.mdx

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ PyPDF-based file processor for extracting text content from documents.
1616
|-------|------|----------|---------|-------------|
1717
| `default_chunk_size_tokens` | `int` | No | 800 | Default chunk size in tokens when chunking_strategy type is 'auto' |
1818
| `default_chunk_overlap_tokens` | `int` | No | 400 | Default chunk overlap in tokens when chunking_strategy type is 'auto' |
19-
| `max_file_size_bytes` | `int` | No | 104857600 | Maximum file size in bytes for uploaded files (default 100MB) |
2019
| `extract_metadata` | `bool` | No | True | Whether to extract PDF metadata (title, author, etc.) |
2120
| `clean_text` | `bool` | No | True | Whether to clean extracted text (remove extra whitespace, normalize line breaks) |
2221

src/llama_stack/providers/inline/file_processor/pypdf/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ async def get_provider_impl(config: PyPDFFileProcessorConfig, deps: dict[Api, An
1717

1818
assert isinstance(config, PyPDFFileProcessorConfig), f"Unexpected config type: {type(config)}"
1919

20-
files_api = deps.get(Api.files)
20+
files_api = deps[Api.files]
2121

2222
impl = PyPDFFileProcessorAdapter(config, files_api)
2323
return impl

src/llama_stack/providers/inline/file_processor/pypdf/adapter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
class PyPDFFileProcessorAdapter:
1717
"""Adapter for PyPDF file processor."""
1818

19-
def __init__(self, config: PyPDFFileProcessorConfig, files_api=None) -> None:
19+
def __init__(self, config: PyPDFFileProcessorConfig, files_api) -> None:
2020
self.config = config
2121
self.files_api = files_api
2222
self.processor = PyPDFFileProcessor(config, files_api)

src/llama_stack/providers/inline/file_processor/pypdf/config.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,6 @@ class PyPDFFileProcessorConfig(BaseModel):
2828
description="Default chunk overlap in tokens when chunking_strategy type is 'auto'",
2929
)
3030

31-
max_file_size_bytes: int = Field(
32-
default=100 * 1024 * 1024,
33-
ge=1,
34-
description="Maximum file size in bytes for uploaded files (default 100MB)",
35-
)
36-
3731
# PDF-specific options
3832
extract_metadata: bool = Field(default=True, description="Whether to extract PDF metadata (title, author, etc.)")
3933

src/llama_stack/providers/inline/file_processor/pypdf/pypdf.py

Lines changed: 73 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@
55
# the root directory of this source tree.
66

77
import io
8+
import mimetypes
89
import time
910
import uuid
1011
from typing import Any
1112

13+
import chardet
1214
from fastapi import HTTPException, UploadFile
1315
from pypdf import PdfReader
1416

@@ -33,7 +35,7 @@
3335
class PyPDFFileProcessor:
3436
"""PyPDF-based file processor for PDF documents."""
3537

36-
def __init__(self, config: PyPDFFileProcessorConfig, files_api=None) -> None:
38+
def __init__(self, config: PyPDFFileProcessorConfig, files_api) -> None:
3739
self.config = config
3840
self.files_api = files_api
3941

@@ -44,72 +46,73 @@ async def process_file(
4446
options: dict[str, Any] | None = None,
4547
chunking_strategy: VectorStoreChunkingStrategy | None = None,
4648
) -> ProcessFileResponse:
47-
"""Process a PDF file and return chunks."""
49+
"""Process a file and return chunks. Supports PDF and text-based files."""
4850

49-
# Validate input
5051
if not file and not file_id:
5152
raise ValueError("Either file or file_id must be provided")
5253
if file and file_id:
5354
raise ValueError("Cannot provide both file and file_id")
5455

5556
start_time = time.time()
5657

57-
# Get PDF content
58+
# Upload size limits are enforced by the router layer (upload_safety.py).
59+
# The provider trusts that `file` has already been bounded-read and
60+
# `file_id` references a file accepted by the Files API.
5861
if file:
59-
# Read from uploaded file
6062
content = await file.read()
61-
if len(content) > self.config.max_file_size_bytes:
62-
raise ValueError(
63-
f"File size {len(content)} bytes exceeds maximum of {self.config.max_file_size_bytes} bytes"
64-
)
6563
filename = file.filename or f"{uuid.uuid4()}.pdf"
6664
elif file_id:
67-
# Get file from file storage using Files API
68-
if not self.files_api:
69-
raise ValueError("Files API not available - cannot process file_id")
70-
71-
# Get file metadata
7265
file_info = await self.files_api.openai_retrieve_file(RetrieveFileRequest(file_id=file_id))
7366
filename = file_info.filename
7467

75-
# Get file content
7668
content_response = await self.files_api.openai_retrieve_file_content(
7769
RetrieveFileContentRequest(file_id=file_id)
7870
)
7971
content = content_response.body
8072

73+
mime_type, _ = mimetypes.guess_type(filename)
74+
mime_category = mime_type.split("/")[0] if (mime_type and "/" in mime_type) else None
75+
76+
if mime_type == "application/pdf":
77+
return self._process_pdf(content, filename, file_id, chunking_strategy, start_time)
78+
elif mime_category == "text":
79+
return self._process_text(content, filename, file_id, chunking_strategy, start_time)
80+
else:
81+
# Attempt text decoding as a fallback for unknown types
82+
log.warning("Unknown mime type, attempting text extraction", mime_type=mime_type, filename=filename)
83+
return self._process_text(content, filename, file_id, chunking_strategy, start_time)
84+
85+
def _process_pdf(
86+
self,
87+
content: bytes,
88+
filename: str,
89+
file_id: str | None,
90+
chunking_strategy: VectorStoreChunkingStrategy | None,
91+
start_time: float,
92+
) -> ProcessFileResponse:
93+
"""Process a PDF file."""
8194
pdf_bytes = io.BytesIO(content)
8295
reader = PdfReader(pdf_bytes)
8396

8497
if reader.is_encrypted:
8598
raise HTTPException(status_code=422, detail="Password-protected PDFs are not supported")
8699

87-
# Extract text from PDF
88100
text_content, failed_pages = self._extract_pdf_text(reader)
89101

90-
# Clean text if configured
91102
if self.config.clean_text:
92103
text_content = self._clean_text(text_content)
93104

94-
# Extract metadata if configured
95105
pdf_metadata = {}
96106
if self.config.extract_metadata:
97107
pdf_metadata = self._extract_pdf_metadata(reader)
98108

99109
document_id = str(uuid.uuid4())
100-
101-
# Prepare document metadata (include filename and file_id)
102-
document_metadata = {
103-
"filename": filename,
104-
**pdf_metadata,
105-
}
110+
document_metadata: dict[str, Any] = {"filename": filename, **pdf_metadata}
106111
if file_id:
107112
document_metadata["file_id"] = file_id
108113

109114
processing_time_ms = int((time.time() - start_time) * 1000)
110-
111-
# Create response metadata
112-
response_metadata = {
115+
response_metadata: dict[str, Any] = {
113116
"processor": "pypdf",
114117
"processing_time_ms": processing_time_ms,
115118
"page_count": pdf_metadata.get("page_count", 0),
@@ -120,13 +123,48 @@ async def process_file(
120123
if failed_pages:
121124
response_metadata["failed_pages"] = failed_pages
122125

123-
# Handle empty text - return empty chunks with metadata
124126
if not text_content or not text_content.strip():
125127
return ProcessFileResponse(chunks=[], metadata=response_metadata)
126128

127-
# Create chunks for non-empty text
128129
chunks = self._create_chunks(text_content, document_id, chunking_strategy, document_metadata)
130+
return ProcessFileResponse(chunks=chunks, metadata=response_metadata)
131+
132+
def _process_text(
133+
self,
134+
content: bytes,
135+
filename: str,
136+
file_id: str | None,
137+
chunking_strategy: VectorStoreChunkingStrategy | None,
138+
start_time: float,
139+
) -> ProcessFileResponse:
140+
"""Process a text-based file (txt, csv, md, etc.)."""
141+
detected = chardet.detect(content)
142+
encoding = detected["encoding"] or "utf-8"
143+
try:
144+
text_content = content.decode(encoding)
145+
except UnicodeDecodeError:
146+
text_content = content.decode("utf-8", errors="replace")
129147

148+
if self.config.clean_text:
149+
text_content = self._clean_text(text_content)
150+
151+
document_id = str(uuid.uuid4())
152+
document_metadata: dict[str, Any] = {"filename": filename}
153+
if file_id:
154+
document_metadata["file_id"] = file_id
155+
156+
processing_time_ms = int((time.time() - start_time) * 1000)
157+
response_metadata: dict[str, Any] = {
158+
"processor": "text",
159+
"processing_time_ms": processing_time_ms,
160+
"extraction_method": "text",
161+
"file_size_bytes": len(content),
162+
}
163+
164+
if not text_content or not text_content.strip():
165+
return ProcessFileResponse(chunks=[], metadata=response_metadata)
166+
167+
chunks = self._create_chunks(text_content, document_id, chunking_strategy, document_metadata)
130168
return ProcessFileResponse(chunks=chunks, metadata=response_metadata)
131169

132170
def _extract_pdf_text(self, reader: PdfReader) -> tuple[str, list[str]]:
@@ -140,8 +178,9 @@ def _extract_pdf_text(self, reader: PdfReader) -> tuple[str, list[str]]:
140178
except Exception as e:
141179
failed_pages.append(f"page {page_num + 1}: {e}")
142180
continue
143-
if page_text and page_text.strip():
144-
text_parts.append(page_text)
181+
if page_text:
182+
if not self.config.clean_text or page_text.strip():
183+
text_parts.append(page_text)
145184

146185
return "\n".join(text_parts), failed_pages
147186

@@ -209,6 +248,9 @@ def _create_chunks(
209248
elif chunking_strategy.type == "static":
210249
chunk_size = chunking_strategy.static.max_chunk_size_tokens
211250
overlap_size = chunking_strategy.static.chunk_overlap_tokens
251+
elif chunking_strategy.type == "contextual":
252+
chunk_size = chunking_strategy.contextual.max_chunk_size_tokens
253+
overlap_size = chunking_strategy.contextual.chunk_overlap_tokens
212254
else:
213255
chunk_size = self.config.default_chunk_size_tokens
214256
overlap_size = self.config.default_chunk_overlap_tokens

src/llama_stack/providers/inline/vector_io/faiss/faiss.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ async def query_hybrid(
344344

345345

346346
class FaissVectorIOAdapter(OpenAIVectorStoreMixin, VectorIO, VectorStoresProtocolPrivate):
347-
"""Vector I/O adapter using FAISS for in-memory vector similarity search."""
347+
"""VectorIO adapter that uses FAISS for similarity search and vector storage."""
348348

349349
def __init__(
350350
self,

src/llama_stack/providers/remote/vector_io/weaviate/weaviate.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ async def query_hybrid(
290290

291291

292292
class WeaviateVectorIOAdapter(OpenAIVectorStoreMixin, VectorIO, VectorStoresProtocolPrivate):
293-
"""Vector I/O adapter for remote Weaviate instances."""
293+
"""VectorIO adapter that uses Weaviate for similarity search and vector storage."""
294294

295295
def __init__(
296296
self,

0 commit comments

Comments
 (0)