Skip to content

Conversation

@spumer
Copy link

@spumer spumer commented Oct 7, 2025

Hi. I'm testing few models and providers for my projects and make some contributions to claude-context repo.

May be this features will be usefull for your projects too.

Thanks for your work!

Below you can see description prepared by Claude Code:

🎯 Summary

Add configurable collection naming strategy to prevent data conflicts when switching between different embedding providers and models.

💡 Motivation

Currently, when users switch between embedding providers (e.g., from Ollama to LlamaCpp), all providers share the same collection name (hybrid_code_chunks_<hash>). This causes:

  • Data contamination: New embeddings overwrite previous ones
  • Invalid comparisons: Mixed embeddings from different models in same collection
  • Lost work: Previous indexing is lost when switching providers

✨ What's New

Core Features

  1. Strict Collection Naming Mode (opt-in via EMBEDDING_STRICT_COLLECTION_NAMES=true)

    • Collections include provider and model: hybrid_<provider>_<model>_<path_hash>_<unique_hash>
    • Example: hybrid_ollama_nomic_embed_text_abc12345_def67890
    • Complete isolation between providers
  2. Custom Collection Names (via MILVUS_COLLECTION_NAME)

    • Override automatic naming with custom collection name
    • Useful for advanced use cases and testing
  3. New getModel() Method

    • Added to base Embedding class
    • Implemented in all providers (Ollama, LlamaCpp, OpenAI, Gemini, VoyageAI)
    • Enables provider+model identification

Backward Compatibility

  • Default behavior unchanged: Legacy naming (hybrid_code_chunks_<hash>) remains default
  • Opt-in feature: Users must explicitly enable strict naming
  • No breaking changes: Existing collections continue to work

🎁 Benefits

For Users

  • Safe experimentation: Test multiple providers without data loss
  • Quality comparison: Compare embedding quality accurately
  • Flexibility: Choose naming strategy that fits workflow
  • Transparency: Debug output shows active configuration

For Project

  • No breaking changes: Fully backward compatible
  • Well-tested: Comprehensive test coverage
  • Well-documented: Complete docs for users
  • Extensible: Easy to add more providers

spumer and others added 4 commits September 21, 2025 00:09
- Add LlamaCppEmbedding class with OpenAI-compatible API
- Support for local llama.cpp servers with nomic-embed-code model
- Automatic code prefix for improved code search quality
- Configurable timeout and dimension auto-detection
- Integration with MCP configuration system
- Environment variables: LLAMACPP_HOST, LLAMACPP_MODEL, LLAMACPP_TIMEOUT, LLAMACPP_CODE_PREFIX

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Add LlamaCpp to supported embedding providers in main README
- Update environment variables documentation with LlamaCpp options
- Add comprehensive LlamaCpp configuration guide to MCP README
- Include setup instructions for local inference on consumer hardware
- Add configuration examples for various MCP clients
- Document LlamaCpp's goal: enable large model inference on Apple Silicon and desktop GPUs

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Add configurable collection naming strategy to prevent conflicts between
different embedding providers and models. This ensures complete isolation
when switching between providers like Ollama and LlamaCpp.

## Core Changes

### Embedding Providers
- Add abstract `getModel()` method to base Embedding class
- Implement `getModel()` in all providers:
  - OllamaEmbedding: returns config.model
  - LlamaCppEmbedding: returns config.model with fallback
  - OpenAIEmbedding: returns config.model
  - GeminiEmbedding: returns config.model with fallback
  - VoyageAIEmbedding: returns config.model

### Collection Naming
- Add `EMBEDDING_STRICT_COLLECTION_NAMES` environment variable
- Implement dual naming strategies in Context.getCollectionName():
  - Legacy (default): `hybrid_code_chunks_<hash>` (backward compatible)
  - Strict: `hybrid_<provider>_<model>_<path_hash>_<unique_hash>`
- Add `customCollectionName` support in ContextConfig
- Ensure model names are sanitized for safe collection naming

### MCP Integration
- Add `embeddingStrictCollectionNames` to ContextMcpConfig
- Auto-set environment variable from MCP config
- Add new variables to debug output:
  - MILVUS_TOKEN (shows length only for security)
  - MILVUS_COLLECTION_NAME
  - LLAMACPP_TIMEOUT
  - LLAMACPP_CODE_PREFIX
  - EMBEDDING_STRICT_COLLECTION_NAMES
- Update help message with new configuration options
- Add examples for strict collection naming usage

### Documentation
- Update .env.example with collection naming configuration
- Add comprehensive examples in MCP help text
- Document all new environment variables

## Benefits
- **Zero conflict risk**: Each provider+model combination gets unique collection
- **Safe experimentation**: Switch providers without data contamination
- **Backward compatible**: Legacy naming works by default
- **Full isolation**: Ollama and LlamaCpp collections never intersect

## Example Collection Names
- Ollama: `hybrid_ollama_nomic_embed_text_abc12345_def67890`
- LlamaCpp: `hybrid_llamacpp_nomic_embed_code_Q4_1_gguf_abc12345_fed09876`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Document new EMBEDDING_STRICT_COLLECTION_NAMES and MILVUS_COLLECTION_NAME
environment variables in all relevant documentation files.

## Updated Documentation

### Environment Variables Guide
- Add MILVUS_COLLECTION_NAME variable description
- Add EMBEDDING_STRICT_COLLECTION_NAMES variable with detailed explanation
- Add collection naming modes comparison (legacy vs strict)
- Add use cases and benefits for strict mode

### MCP README
- Add new "Collection Naming Configuration" section
- Document both naming modes with examples
- Explain format differences: `hybrid_code_chunks_<hash>` vs `hybrid_<provider>_<model>_<hash>_<unique>`
- Recommend strict mode for multi-provider experimentation

## Benefits for Users
- Clear understanding of collection naming behavior
- Guidance on when to use strict mode
- Prevention of data conflicts when switching providers
- Complete reference for all configuration options

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant