-
Couldn't load subscription status.
- Fork 149
feat: add support for automatic embeddings for the insert many tool MCP-236 #688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9326ed4 to
3bf73da
Compare
Pull Request Test Coverage Report for Build 18912204743Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for automatic vector embeddings generation in the insert-many tool, allowing users to generate embeddings for fields with vector search indexes during document insertion. The implementation adds new functionality to generate embeddings using external providers like Voyage AI while maintaining backward compatibility.
Key changes:
- Enhanced the
insert-manytool to accept optional embedding parameters for automatic embedding generation - Improved vector search validation with dedicated assertion methods for better error handling
- Added comprehensive test coverage including integration tests with real embedding generation and accuracy tests
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/tools/mongodb/create/insertMany.ts |
Core implementation of embedding generation logic with validation and document processing |
src/common/search/vectorSearchEmbeddingsManager.ts |
Refactored validation methods and added assertion utilities for vector search operations |
src/tools/mongodb/read/aggregate.ts |
Updated to use new assertion method for vector search index validation |
tests/integration/tools/mongodb/create/insertMany.test.ts |
Comprehensive integration tests for embedding generation scenarios |
tests/unit/common/search/vectorSearchEmbeddingsManager.test.ts |
Unit tests for new assertion functionality |
tests/accuracy/insertMany.embeddings.test.ts |
Accuracy tests to validate tool behavior with different AI models |
tests/accuracy/sdk/matcher.ts |
Added null matcher utility for test assertions |
src/common/errors.ts |
Added new error code for unexpected errors |
CONTRIBUTING.md |
Documentation update for colima Docker configuration |
| await this.session.vectorSearchEmbeddingsManager.assertFieldsHaveCorrectEmbeddings( | ||
| { database, collection }, | ||
| documents | ||
| ); |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation is performed after potentially expensive embedding generation. Consider moving this validation before the embedding generation step to fail fast and avoid unnecessary API calls when documents have invalid embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well the embeddings generation (when it happens) is relevant for asserting if the embeddings are correct, mr robot
| beforeEach(async () => { | ||
| await integration.connectMcpClient(); | ||
| collection = await integration.mongoClient().db(integration.randomDbName()).createCollection("test"); | ||
| database = integration.randomDbName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regret doing this because there's too many calls of this function but I'd not have expected integration.randomDbName() to always return the same value so switched to storing in a variable
| npm test -- path/to/directory | ||
| ``` | ||
|
|
||
| #### Accuracy Tests and colima |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is considered niche but I find colima to be a nice simple way to run docker on Mac and if this existed I'd not have lost many painful moments of figuring this bit out
| protected description = "Insert an array of documents into a MongoDB collection"; | ||
| protected argsShape = { | ||
| ...DbOperationArgs, | ||
| documents: z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay so I did my best to prompt engineer it to generate embeddings by supplying plain text to the vector search indexed fields as strings inside documents.
However, in my experience models get confused about this idea of "the schema says it's a vector but I need to ignore this and supply a string" and I also think relying on the description of the documents field for this is less scalable.
So, the best way I found to force the model to provide these vector index raw strings is to put it as a required input as part of embeddingParameters. It seems to do a good job at mapping the 2 arrays and respective indexes. It does sometimes provide a string in place of the original document's field in the input, but this gets overriden so seems quite fine.
Open to suggestions if I may be overlooking something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work fine, we have the accuracy tests if something goes wary in the future and we can safely tweak the prompt.
|
I'll look into adding a few more tests I guess to keep coveralls happy |
| await this.session.vectorSearchEmbeddingsManager.assertFieldsHaveCorrectEmbeddings( | ||
| { database, collection }, | ||
| documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[q] do we need to run this if embeddings are not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was existing behavior so I kept it as we would want to prevent the model from adding artbirary data to a vector search indexed field.
that said, I'm wondering if the insert many call would have failed at DB-level anyhow; not sure if that's redundant then.
cc: @kmruiz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @gagik is completely right. It's to prevent the agent adding a raw field in a place where the server expects an embedding. Currently, the server does not reject these invalid values (unless the user specifies a JSON Schema) and it would break the VS Index and the behaviour can be pretty inconsistent depending on how it breaks.
| embeddingParameters: SupportedEmbeddingParameters; | ||
| inputType: EmbeddingParameters["inputType"]; | ||
| }): Promise<unknown[]> { | ||
| }): Promise<number[][]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, this is not correct. An Embedding can be also other types, like object[] (for different types of quantisations). That's why it was an unknown[].
Adds automatic generation using the Voyage API key, with a system to minimize requests being made in bulk.