Skip to content

fix(vector_io): wire file_processors provider into vector store file insertion#5339

Merged
franciscojavierarceo merged 4 commits intollamastack:mainfrom
alinaryan:fix-vector-store-file-insert
Apr 2, 2026
Merged

fix(vector_io): wire file_processors provider into vector store file insertion#5339
franciscojavierarceo merged 4 commits intollamastack:mainfrom
alinaryan:fix-vector-store-file-insert

Conversation

@alinaryan
Copy link
Copy Markdown
Contributor

@alinaryan alinaryan commented Mar 27, 2026

When a file is posted to a vector store via POST /v1/vector_stores/{id}/files, the mixin code checks for a file_processor_api to process the file (e.g., using pypdf or docling for PDF
parsing). However, that attribute was never wired up — no vector_io provider received the file_processors dependency from the resolver. So vector store file insertion always fell
through to the legacy PyPDF chunking path, regardless of which file_processors provider was configured.

This PR fixes that by:

  • Adding Api.file_processors to optional_api_dependencies for all vector_io provider specs
  • Threading the dependency through each provider's factory function and constructor to the OpenAIVectorStoreMixin
  • Adding recording/replay support for the file_processors API in the integration test recording system, so that file processor output is captured during record mode and replayed
    deterministically in CI without running the actual processor

Test plan

Manual verification (local, inline::docling)

Save the following config to ~/.llama/distributions/providers-run/config.yaml:

config.yaml
version: 2
apis: [file_processors, files, vector_io, inference]
providers:
  file_processors:
  - provider_id: docling
    provider_type: inline::docling
    config:
      default_chunk_size_tokens: 800
      default_chunk_overlap_tokens: 400
  files:
  - provider_id: localfs
    provider_type: inline::localfs
    config:
      storage_dir: /tmp/llama-test/files
      metadata_store:
        table_name: files_metadata
        backend: sql_default
  vector_io:
  - provider_id: faiss
    provider_type: inline::faiss
    config:
      persistence:
        namespace: vector_io::faiss
        backend: kv_default
  inference:
  - provider_id: sentence-transformers
    provider_type: inline::sentence-transformers
    config:
      trust_remote_code: true
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: /tmp/llama-test/kvstore.db
    sql_default:
      type: sql_sqlite
      db_path: /tmp/llama-test/sql_store.db

Start the server:

LLAMA_STACK_LOGGING=providers=debug llama stack run ~/.llama/distributions/providers-run/config.yaml --port 8321

Upload a PDF, create a vector store, attach the file, and search:

FILE_ID=$(curl -s http://localhost:8321/v1/files \
  -F purpose=assistants \
  -F file=@tests/integration/responses/fixtures/pdfs/llama_stack_and_models.pdf | jq -r '.id')

VS_ID=$(curl -s http://localhost:8321/v1/vector_stores \
  -H "Content-Type: application/json" \
  -d '{"name":"test","metadata":{"embedding_model":"sentence-transformers/nomic-ai/nomic-embed-text-v1.5","embedding_dimension":768}}' | jq -r '.id')

curl -s "http://localhost:8321/v1/vector_stores/$VS_ID/files" \
  -H "Content-Type: application/json" \
  -d "{\"file_id\":\"$FILE_ID\"}"

curl -s "http://localhost:8321/v1/vector_stores/$VS_ID/search" \
  -H "Content-Type: application/json" \
  -d '{"query":"How many experts does Llama 4 Maverick have?","max_num_results":3}' | jq '.data[].content[].text'

Verify:

  • Server logs show Using FileProcessor API to process file
  • Search results return markdown-formatted content (docling output) with first result containing "128 experts"

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 27, 2026
…insertion

Vector store file insertion was always using the legacy pypdf chunking
path because the configured file_processors provider was never injected
into vector_io providers. Add Api.file_processors as an optional
dependency and thread it through all provider constructors to the mixin.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
@alinaryan alinaryan force-pushed the fix-vector-store-file-insert branch from a43f5d7 to c6a1502 Compare March 27, 2026 19:20
@github-actions
Copy link
Copy Markdown
Contributor

Recording workflow completed

Providers: gpt, azure

Recordings have been generated and will be committed automatically by the companion workflow.

View workflow run

Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled.

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

@alinaryan alinaryan force-pushed the fix-vector-store-file-insert branch from fe3dcfc to 0f08163 Compare April 1, 2026 19:01
Add monkey-patching for PyPDFFileProcessorAdapter.process_file so that
the file processor output is recorded during record mode and replayed
during replay mode. This avoids running the actual file processor in CI
replay tests, eliminating non-determinism from random UUIDs and
platform-dependent tokenization.

Only intercepts calls with a file_id (internal calls from the vector
store mixin). Direct HTTP uploads to the file-processors endpoint pass
through to the real provider unmodified.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
@alinaryan alinaryan force-pushed the fix-vector-store-file-insert branch from 0f08163 to c2c6cc6 Compare April 2, 2026 15:38
@alinaryan alinaryan marked this pull request as ready for review April 2, 2026 19:29
Copy link
Copy Markdown
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@franciscojavierarceo franciscojavierarceo added this pull request to the merge queue Apr 2, 2026
Merged via the queue into llamastack:main with commit 725a0c3 Apr 2, 2026
65 checks passed
@franciscojavierarceo franciscojavierarceo deleted the fix-vector-store-file-insert branch April 2, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants