fix(wren-ai-service): expose batch_size in LitellmEmbedderProvider to cap embedding API call size#2201
Conversation
…to embedding node When Hamilton's AsyncDriver executes the indexing DAG, it wraps async nodes in asyncio Tasks. Under complex MDL schemas with many relationships, the async chunk node's Task was being passed unawaited to the downstream embedding node instead of the actual dict result, causing the embedder to receive an asyncio Task repr string rather than the document chunks. This makes DDLChunker.run() and its helpers synchronous, matching the pattern used by all other indexing pipelines (historical_question, table_description, project_meta). The async machinery in _model_preprocessor was unnecessary since MODEL_PREPROCESSORS is empty by default and all helper operations are CPU-bound string manipulations. Update tests to call chunker.run() synchronously accordingly. Fixes Canner#2138
DDLChunker.run() is now synchronous, so the chunker test cases no longer need pytest.mark.asyncio or async def. Only test_pipeline_run keeps async because it still awaits DBSchema.run.
… cap embedding API calls (fixes Canner#2031) The document embedder's batch_size (how many texts are sent per embedding API call) was hardcoded to 32 inside AsyncDocumentEmbedder and not reachable from config.yaml. Users with embedding providers that enforce a lower batch limit (e.g. max 10) had no way to reduce it — setting column_indexing_batch_size in the settings section only controls DDL chunking, not the embedding API batch size. Add batch_size as an explicit, named parameter to LitellmEmbedderProvider and forward it to get_document_embedder(). Users can now set it per-model in the embedder section of config.yaml: type: embedder provider: litellm_embedder models: - model: openai/text-embedding-v4 alias: default batch_size: 10 # cap API call size for providers with batch limits Also document the option with a commented example in config.qwen3.yaml. Co-Authored-By: Octopus <liyuan851277048@icloud.com>
WalkthroughThe PR addresses a configuration issue where Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
wren-ai-service/src/providers/embedder/litellm.py (1)
175-175: Optional: consider guarding against non-positivebatch_sizevalues.If a user mis-configures
batch_size: 0(or a negative value) inconfig.yaml,_embed_batchwill raiseValueError: range() arg 3 must not be zeroor produce an emptybatcheslist at call time, rather than failing fast at provider initialization. A small validation here would surface the misconfiguration early with a clearer message.🛡️ Proposed validation
timeout: float = 120.0, batch_size: int = 32, # number of documents sent per embedding API call **kwargs, ): + if batch_size <= 0: + raise ValueError( + f"batch_size must be a positive integer, got {batch_size}" + ) self._api_key = os.getenv(api_key_name) if api_key_name else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wren-ai-service/src/providers/embedder/litellm.py` at line 175, Validate the batch_size parameter at provider initialization (e.g., in the Litellm embedder's __init__ or factory) to ensure batch_size is a positive integer; if batch_size <= 0 raise a clear ValueError explaining the misconfiguration (mention config.yaml) so the issue surfaces at startup instead of inside _embed_batch where range() or empty batches would fail silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@wren-ai-service/src/providers/embedder/litellm.py`:
- Line 175: Validate the batch_size parameter at provider initialization (e.g.,
in the Litellm embedder's __init__ or factory) to ensure batch_size is a
positive integer; if batch_size <= 0 raise a clear ValueError explaining the
misconfiguration (mention config.yaml) so the issue surfaces at startup instead
of inside _embed_batch where range() or empty batches would fail silently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2d933de8-c44b-4354-a147-41759106deb5
📒 Files selected for processing (4)
wren-ai-service/docs/config_examples/config.qwen3.yamlwren-ai-service/src/pipelines/indexing/db_schema.pywren-ai-service/src/providers/embedder/litellm.pywren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py
Fixes #2031
Problem
Users whose embedding providers enforce a maximum batch size (e.g. max 10 documents per API call) had no way to configure the embedding API batch size from
config.yaml.The internal
AsyncDocumentEmbedder.batch_size(which controls how many texts are sent in each call toaembedding()) defaulted to 32 and was not exposed byLitellmEmbedderProvider. Settingcolumn_indexing_batch_sizein thesettingssection has a different effect — it controls how many columns are grouped into a single DDL schema chunk, not how many documents are batched for the embedding API call.As a result, users received
400errors like:Solution
Add
batch_sizeas an explicit named parameter toLitellmEmbedderProvider.__init__()(defaulting to32) and forward it explicitly toAsyncDocumentEmbedderinget_document_embedder().Users can now set it per-model in the
embeddersection ofconfig.yaml:Also added a commented example in
config.qwen3.yamlto document the option.Testing
aembedding()call to at mostbatch_sizedocuments.Summary by CodeRabbit
Release Notes
New Features
Documentation