Skip to content

Conversation

@gagik
Copy link
Collaborator

@gagik gagik commented Oct 24, 2025

Adds automatic generation using the Voyage API key, with a system to minimize requests being made in bulk.

@gagik gagik force-pushed the gagik/accuracy-embeddings branch from 9326ed4 to 3bf73da Compare October 27, 2025 12:50
@coveralls
Copy link
Collaborator

coveralls commented Oct 29, 2025

Pull Request Test Coverage Report for Build 18912204743

Details

  • 92 of 151 (60.93%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 79.885%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tools/mongodb/read/aggregate.ts 4 14 28.57%
src/tools/mongodb/create/insertMany.ts 45 94 47.87%
Totals Coverage Status
Change from base Build 18909867313: -0.3%
Covered Lines: 6412
Relevant Lines: 7890

💛 - Coveralls

@gagik gagik marked this pull request as ready for review October 29, 2025 14:38
@gagik gagik requested a review from a team as a code owner October 29, 2025 14:38
Copilot AI review requested due to automatic review settings October 29, 2025 14:38
@gagik gagik changed the title WIP - insert many with generated embeddings feat: add support for automatic embeddings for the insert many tool Oct 29, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for automatic vector embeddings generation in the insert-many tool, allowing users to generate embeddings for fields with vector search indexes during document insertion. The implementation adds new functionality to generate embeddings using external providers like Voyage AI while maintaining backward compatibility.

Key changes:

  • Enhanced the insert-many tool to accept optional embedding parameters for automatic embedding generation
  • Improved vector search validation with dedicated assertion methods for better error handling
  • Added comprehensive test coverage including integration tests with real embedding generation and accuracy tests

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/tools/mongodb/create/insertMany.ts Core implementation of embedding generation logic with validation and document processing
src/common/search/vectorSearchEmbeddingsManager.ts Refactored validation methods and added assertion utilities for vector search operations
src/tools/mongodb/read/aggregate.ts Updated to use new assertion method for vector search index validation
tests/integration/tools/mongodb/create/insertMany.test.ts Comprehensive integration tests for embedding generation scenarios
tests/unit/common/search/vectorSearchEmbeddingsManager.test.ts Unit tests for new assertion functionality
tests/accuracy/insertMany.embeddings.test.ts Accuracy tests to validate tool behavior with different AI models
tests/accuracy/sdk/matcher.ts Added null matcher utility for test assertions
src/common/errors.ts Added new error code for unexpected errors
CONTRIBUTING.md Documentation update for colima Docker configuration

Comment on lines +52 to +55
await this.session.vectorSearchEmbeddingsManager.assertFieldsHaveCorrectEmbeddings(
{ database, collection },
documents
);
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is performed after potentially expensive embedding generation. Consider moving this validation before the embedding generation step to fail fast and avoid unnecessary API calls when documents have invalid embeddings.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@gagik gagik Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the embeddings generation (when it happens) is relevant for asserting if the embeddings are correct, mr robot

beforeEach(async () => {
await integration.connectMcpClient();
collection = await integration.mongoClient().db(integration.randomDbName()).createCollection("test");
database = integration.randomDbName();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regret doing this because there's too many calls of this function but I'd not have expected integration.randomDbName() to always return the same value so switched to storing in a variable

npm test -- path/to/directory
```

#### Accuracy Tests and colima
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is considered niche but I find colima to be a nice simple way to run docker on Mac and if this existed I'd not have lost many painful moments of figuring this bit out

protected description = "Insert an array of documents into a MongoDB collection";
protected argsShape = {
...DbOperationArgs,
documents: z
Copy link
Collaborator Author

@gagik gagik Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay so I did my best to prompt engineer it to generate embeddings by supplying plain text to the vector search indexed fields as strings inside documents.
However, in my experience models get confused about this idea of "the schema says it's a vector but I need to ignore this and supply a string" and I also think relying on the description of the documents field for this is less scalable.
So, the best way I found to force the model to provide these vector index raw strings is to put it as a required input as part of embeddingParameters. It seems to do a good job at mapping the 2 arrays and respective indexes. It does sometimes provide a string in place of the original document's field in the input, but this gets overriden so seems quite fine.

Open to suggestions if I may be overlooking something here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work fine, we have the accuracy tests if something goes wary in the future and we can safely tweak the prompt.

@gagik gagik changed the title feat: add support for automatic embeddings for the insert many tool feat: add support for automatic embeddings for the insert many tool MCP-236 Oct 29, 2025
@gagik
Copy link
Collaborator Author

gagik commented Oct 29, 2025

I'll look into adding a few more tests I guess to keep coveralls happy

Comment on lines +60 to +62
await this.session.vectorSearchEmbeddingsManager.assertFieldsHaveCorrectEmbeddings(
{ database, collection },
documents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] do we need to run this if embeddings are not enabled?

Copy link
Collaborator Author

@gagik gagik Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was existing behavior so I kept it as we would want to prevent the model from adding artbirary data to a vector search indexed field.
that said, I'm wondering if the insert many call would have failed at DB-level anyhow; not sure if that's redundant then.
cc: @kmruiz

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @gagik is completely right. It's to prevent the agent adding a raw field in a place where the server expects an embedding. Currently, the server does not reject these invalid values (unless the user specifies a JSON Schema) and it would break the VS Index and the behaviour can be pretty inconsistent depending on how it breaks.

embeddingParameters: SupportedEmbeddingParameters;
inputType: EmbeddingParameters["inputType"];
}): Promise<unknown[]> {
}): Promise<number[][]> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this is not correct. An Embedding can be also other types, like object[] (for different types of quantisations). That's why it was an unknown[].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants