Skip to content

(ChromaDB Plugin) Convert current ChromaDB Implementation to be a Plugin #229

@NoveliaYuki

Description

@NoveliaYuki

Knowledge Store Plugin Architecture - Implementation Plan

Version: 4.0 (Technology-Agnostic API)
Status: Draft for Review
Prerequisite: Issue #203 (KnowledgeStore Setups) must be implemented first


Executive Summary

This proposal extends the KnowledgeStore Setup infrastructure from Issue #203 to enable technology-agnostic retrieval backends for KB-Server. By expanding organization-level setup management to support multiple plugin types, we achieve:

  1. Technology-agnostic retrieval - Same API for vector DBs, graph DBs, keyword search, hybrid systems
  2. Per-collection backend selection - Different collections can use different technologies
  3. Seamless backend migration - Change backends without recreating collections
  4. Simplified service layer - Services work with chunks + metadata only, never touching embeddings/graphs/indexes
  5. Full backward compatibility - ChromaDB remains default, all existing functionality preserved

Relationship to Issue #203

Issue #203 provides:

  • Organization-level setup management
  • Collections reference setups (not inline configs)
  • API key rotation capability
  • Setup reusability across collections

This proposal extends #203 by:

  • Renaming embeddings_setupsknowledge_store_setups
  • Adding plugin_type field to specify backend technology
  • Making plugin_config generic JSON (not embedding-specific)
  • Implementing plugin architecture that reads from setups
  • Enabling multi-backend support (vector, graph, keyword, hybrid)

Architecture Overview

Current State (After Issue #203)

Organizations
    ↓ 1:N
Embeddings Setups (embeddings-specific)
    ↓ 1:N
Collections
    ↓
ChromaDB (hardcoded, ONLY backend)

Problems:

  • ChromaDB-only (can't support Neo4j, ElasticSearch, etc.)
  • embeddings_setups name implies vector-only
  • Setup config is embeddings-specific
  • Services directly call ChromaDB APIs

Target State (Plugin Architecture)

Organizations
    ↓ 1:N
KnowledgeStore Setups (renamed & expanded)
  Setup 1: ChromaDB + OpenAI (plugin_type: "chromadb")
  Setup 2: Neo4j Graph (plugin_type: "neo4j")
  Setup 3: ElasticSearch (plugin_type: "elasticsearch")
    ↓ 1:N
Collections (unchanged - still reference setup_id)
    ↓
Service Layer (technology-agnostic: chunks + metadata only)
    ↓
KnowledgeStorePlugin Interface (abstract)
    ↓
┌──────────┬──────────┬──────────┬──────────┐
│ ChromaDB │  Qdrant  │  Neo4j   │ElasticSch│
│ Plugin   │  Plugin  │  Plugin  │  Plugin  │
└──────────┴──────────┴──────────┴──────────┘

Key Concept: Enhanced KnowledgeStore Setup

Before (Issue #203):

{
  "name": "OpenAI Production",
  "setup_key": "openai-prod",
  "vendor": "openai",
  "model_name": "text-embedding-3-small",
  "api_key": "sk-...",
  "api_endpoint": "https://api.openai.com/v1/embeddings",
  "embedding_dimensions": 1536
}

After (This Proposal):

{
  "name": "ChromaDB with OpenAI",
  "setup_key": "chromadb-openai-prod",
  "plugin_type": "chromadb",  // NEW: Specifies plugin
  "plugin_config": {          // NEW: Generic JSON config
    "vendor": "openai",
    "model": "text-embedding-3-small",
    "api_key": "sk-...",
    "api_endpoint": "https://api.openai.com/v1/embeddings",
    "embedding_dimensions": 1536
  }
}

Other Plugin Examples:

// Neo4j (graph-based, no embeddings)
{
  "plugin_type": "neo4j",
  "plugin_config": {
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "secret",
    "schema": {...}
  }
}

// ElasticSearch (keyword search)
{
  "plugin_type": "elasticsearch",
  "plugin_config": {
    "hosts": ["localhost:9200"],
    "api_key": "...",
    "analyzer": "english"
  }
}

Database Schema Changes

Assumptions from Issue #203

  • organizations table exists
  • embeddings_setups table exists
  • collections.organization_id exists
  • collections.embeddings_setup_id exists

Required Migrations

-- 1. Rename table
ALTER TABLE embeddings_setups RENAME TO knowledge_store_setups;

-- 2. Add plugin_type field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_type TEXT NOT NULL DEFAULT 'chromadb';

-- 3. Add plugin_config JSON field
ALTER TABLE knowledge_store_setups
ADD COLUMN plugin_config JSON;

-- 4. Migrate existing data into plugin_config
UPDATE knowledge_store_setups
SET plugin_config = json_object(
    'vendor', vendor,
    'api_endpoint', api_endpoint,
    'api_key', api_key,
    'model', model_name,
    'embedding_dimensions', embedding_dimensions
)
WHERE plugin_type = 'chromadb';

-- 5. Add new column to collections
ALTER TABLE collections
ADD COLUMN knowledge_store_setup_id INTEGER
REFERENCES knowledge_store_setups(id);

-- 6. Copy data
UPDATE collections
SET knowledge_store_setup_id = embeddings_setup_id;

-- 7. Create indexes
CREATE INDEX idx_collections_knowledge_store_setup
    ON collections(knowledge_store_setup_id);
CREATE INDEX idx_knowledge_store_setups_org_plugin
    ON knowledge_store_setups(organization_id, plugin_type);

API Changes

Setup Endpoints (Extended from #203)

List Setups - Now includes plugin_type:

GET /organizations/{org_id}/knowledge-store-setups

Response:
{
  "setups": [
    {
      "id": 1,
      "name": "ChromaDB with OpenAI",
      "setup_key": "chromadb-openai-prod",
      "plugin_type": "chromadb",  // NEW
      "plugin_config": {...},      // NEW
      "collections_count": 42
    }
  ]
}

Create Setup - Requires plugin_type:

POST /organizations/{org_id}/knowledge-store-setups

{
  "name": "Neo4j Knowledge Graph",
  "setup_key": "neo4j-prod",
  "plugin_type": "neo4j",    // NEW: Required
  "plugin_config": {         // NEW: Plugin-specific
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "secret"
  }
}

New Endpoints

List Available Plugin Types:

GET /knowledge-store-plugins

Response:
{
  "plugins": [
    {
      "plugin_type": "chromadb",
      "name": "ChromaDB",
      "description": "Vector database with persistent storage",
      "supports_embeddings": true,
      "supports_metadata": true
    },
    {
      "plugin_type": "neo4j",
      "name": "Neo4j",
      "description": "Graph database for knowledge graphs",
      "supports_embeddings": false,
      "supports_metadata": true
    }
  ]
}

Validate Plugin Config:

POST /knowledge-store-plugins/{plugin_type}/validate-config

{
  "plugin_config": {
    "uri": "bolt://localhost:7687",
    "user": "neo4j",
    "password": "test"
  }
}

Response:
{
  "valid": true,
  "errors": [],
  "warnings": []
}

Collection Creation (Unchanged from #203!)

POST /collections

{
  "name": "Research Papers",
  "organization_external_id": "org_123",
  "knowledge_store_setup_key": "neo4j-prod"  // Same as #203
}

Key Point: Collection API unchanged - backend selection happens via setup choice.


Plugin Architecture

KnowledgeStorePlugin Interface

class KnowledgeStorePlugin(abc.ABC):
    """Technology-agnostic interface for retrieval backends.

    Plugins work with CHUNKS and METADATA only.
    No embeddings, vectors, or graphs in the interface!
    """

    name: str = "plugin_name"
    supports_embeddings: bool = False
    supports_metadata: bool = True

    # Initialization
    @abc.abstractmethod
    def initialize(self, global_config: Dict) -> None:
        """Initialize plugin (e.g., ChromaDB storage path)."""
        pass

    @classmethod
    @abc.abstractmethod
    def validate_plugin_config(cls, plugin_config: Dict) -> Dict:
        """Validate setup's plugin_config."""
        pass

    # Collection Operations
    @abc.abstractmethod
    def create_collection(self, name: str, setup_id: int,
                         metadata: Dict = None) -> Any:
        """Create collection. Plugin resolves setup_id to get config."""
        pass

    # Chunk Operations (technology-agnostic!)
    @abc.abstractmethod
    def add_chunks(self, collection, chunk_ids: List[str],
                   chunk_texts: List[str],
                   chunk_metadata: List[Dict] = None) -> None:
        """Add chunks. Plugin decides: embeddings? graph? index?"""
        pass

    @abc.abstractmethod
    def query_chunks(self, collection, query_text: str,
                    n_results: int = 10, filters: Dict = None) -> Dict:
        """Query chunks. Plugin decides how to process query."""
        pass

    # Helper to resolve setup
    def _resolve_setup(self, setup_id: int) -> Dict:
        """Get plugin_config from setup (enables key rotation!)."""
        setup = db.get_knowledge_store_setup(setup_id)
        return setup.plugin_config

ChromaDB Plugin Implementation

@PluginRegistry.register
class ChromaDBPlugin(KnowledgeStorePlugin):
    name = "chromadb"
    supports_embeddings = True

    def create_collection(self, name, setup_id, metadata):
        # Resolve setup to get plugin_config
        plugin_config = self._resolve_setup(setup_id)

        # Store setup_id in collection metadata (for later resolution)
        collection_metadata = {
            "knowledge_store_setup_id": setup_id,
            "hnsw:space": "cosine"
        }

        return self._client.create_collection(
            name=name,
            metadata=collection_metadata
        )

    def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
        # Get embedding function from setup (INTERNAL to plugin!)
        embedding_func = self._get_embedding_function(collection)

        # Compute embeddings (service layer never touches this!)
        embeddings = embedding_func(chunk_texts)

        # Store chunks with embeddings
        collection.add(
            ids=chunk_ids,
            documents=chunk_texts,
            embeddings=embeddings,
            metadatas=chunk_metadata
        )

    def _get_embedding_function(self, collection):
        """Get embedding function from setup config.

        KEY CHANGE: Resolves setup each time (allows key rotation!)
        """
        setup_id = collection.metadata["knowledge_store_setup_id"]
        plugin_config = self._resolve_setup(setup_id)

        return get_embedding_function_by_params(
            vendor=plugin_config.get("vendor"),
            model_name=plugin_config.get("model"),
            api_key=plugin_config.get("api_key"),  # Gets CURRENT key!
            api_endpoint=plugin_config.get("api_endpoint")
        )

Neo4j Plugin (Example - No Embeddings!)

@PluginRegistry.register
class Neo4jPlugin(KnowledgeStorePlugin):
    name = "neo4j"
    supports_embeddings = False  # Graph-based, no embeddings!

    def add_chunks(self, collection, chunk_ids, chunk_texts, chunk_metadata):
        # Extract entities from texts (NO EMBEDDINGS!)
        for i, text in enumerate(chunk_texts):
            entities = self._extract_entities(text)

            # Build knowledge graph
            with self._driver.session() as session:
                session.run(
                    "CREATE (c:Chunk {id: $id, text: $text})",
                    id=chunk_ids[i], text=text
                )
                for entity in entities:
                    session.run(
                        "MERGE (e:Entity {name: $name}) "
                        "CREATE (c)-[:MENTIONS]->(e)",
                        name=entity
                    )

    def query_chunks(self, collection, query_text, n_results, filters):
        # Graph traversal (NO EMBEDDINGS!)
        entities = self._extract_entities(query_text)

        with self._driver.session() as session:
            result = session.run(
                """
                MATCH (c:Chunk)-[:MENTIONS]->(e:Entity)
                WHERE e.name IN $entities
                RETURN c.id, c.text, count(e) as relevance
                ORDER BY relevance DESC
                LIMIT $limit
                """,
                entities=entities, limit=n_results
            )
            return self._format_results(result)

Service Layer Changes

Before (Services Compute Embeddings)

# services/ingestion.py (OLD)
from services.embedding import EmbeddingService

def ingest_chunks(collection_name, chunks, metadata):
    client = get_chroma_client()
    collection = client.get_collection(collection_name)

    # Service computes embeddings (TIGHT COUPLING!)
    embedding_config = get_collection_embedding_config(collection_name)
    embeddings = EmbeddingService.compute_embeddings(chunks, embedding_config)

    collection.add(ids, chunks, embeddings, metadata)

After (Technology-Agnostic)

# services/ingestion.py (NEW)
from database.connection import get_knowledge_store

def ingest_chunks(collection_name, chunks, metadata):
    # Get collection from DB to find its setup
    db_collection = db.get_collection_by_name(collection_name)
    setup = db.get_knowledge_store_setup(db_collection.knowledge_store_setup_id)

    # Get appropriate plugin
    plugin = get_knowledge_store(setup.plugin_type)
    collection = plugin.get_collection(collection_name)

    # Just pass chunks + metadata (technology-agnostic!)
    plugin.add_chunks(collection, chunk_ids, chunks, metadata)

    # Plugin handles the rest:
    # - ChromaDB plugin: computes embeddings, stores vectors
    # - Neo4j plugin: extracts entities, builds graph
    # - ElasticSearch plugin: tokenizes, builds index

Key Changes:

  • ❌ Remove services/embedding.py entirely
  • ✅ Services never compute embeddings
  • ✅ Services work with chunks + metadata only
  • ✅ Same code works for any plugin!

Migration Plan

Phase 1: Database Schema (Week 1)

  • Rename embeddings_setupsknowledge_store_setups
  • Add plugin_type column (default 'chromadb')
  • Add plugin_config JSON column
  • Migrate existing fields into plugin_config
  • Add knowledge_store_setup_id to collections

Phase 2: Plugin Infrastructure (Week 2)

  • Implement KnowledgeStorePlugin base class
  • Implement ChromaDBPlugin (reads from setups)
  • Implement KnowledgeStoreService
  • Update PluginRegistry

Phase 3: Service Layer Refactoring (Week 3)

  • Remove embedding computation from services
  • Replace get_chroma_client() with get_knowledge_store()
  • Delete services/embedding.py
  • Update ingestion, query, collections services

Phase 4: API & Router Updates (Week 4)

  • Add /knowledge-store-plugins endpoints
  • Update router to use plugin resolution
  • Configure KnowledgeStoreService in startup

Phase 5: LAMB Integration (Week 5)

  • Update LAMB endpoints: /embeddings-setups/knowledge-store-setups
  • Update frontend labels
  • Show plugin type in setup list

Phase 6: Testing (Week 6)

  • Run KB-Server E2E tests (must pass 100%)
  • Run Playwright tests (must pass 100%)
  • Verify backward compatibility

Common Scenarios

Scenario 1: API Key Rotation (Same as #203)

Admin updates ChromaDB setup's plugin_config.api_key
→ All 42 collections using this setup immediately use new key
→ No collection-level updates needed!

Scenario 2: Adding New Backend (NEW)

Admin creates new Neo4j setup:
  plugin_type: "neo4j"
  plugin_config: {uri, user, password, schema}

→ Users can now choose Neo4j when creating collections
→ Existing collections unaffected

Scenario 3: Choosing Backend by Use Case (NEW)

Use Case                      → Best Plugin
─────────────────────────────────────────────
Document Q&A, semantic search → ChromaDB/Qdrant (vector)
Knowledge graph, relationships → Neo4j (graph)
Exact phrase matching, legal  → ElasticSearch (keyword)
Hybrid: semantic + keyword    → Hybrid plugin

Benefits Summary

Technical Benefits

Technology-agnostic - Same API for vector, graph, keyword, hybrid
Plugin autonomy - Each plugin decides how to process chunks
Simpler services - No embedding knowledge, cleaner code
Better testing - Mock plugins easily
Future-proof - New retrieval tech added without touching services

Business Benefits

Flexibility - Organizations choose best tech for their needs
Risk reduction - Not locked to single database vendor
Innovation - Experiment with new retrieval approaches
Cost optimization - Use cost-effective backends per use case

Comparison to Current State

Aspect Before (Issue #203) After (Plugin Architecture)
Setup naming embeddings_setups knowledge_store_setups
Config structure Individual fields Generic plugin_config JSON
Backend support ChromaDB only ChromaDB, Neo4j, ElasticSearch, etc.
Plugin selection N/A (implicit) Via plugin_type field
Service layer Knows embeddings Technology-agnostic
Embedding computation Service layer Plugin-internal
Adding backend Refactor 8+ files Create plugin, register
API key rotation ✅ Works ✅ Works
Per-collection backend ❌ All ChromaDB ✅ Each chooses

Validation & Testing

Backward Compatibility Requirements

  • All existing collections continue working
  • Can ingest to existing collections
  • Can query existing collections
  • API key rotation still works
  • All E2E tests pass without modification
  • All Playwright tests pass without modification

New Features to Verify

  • Can list available plugin types
  • Can validate plugin configs
  • ChromaDB plugin works identically to pre-plugin architecture
  • Organizations can create setups with different plugin types
  • Collections can be created with any active setup

Open Questions

  1. Plugin Config Migration: Keep old individual fields for backward compatibility or require immediate migration to plugin_config?

  2. Cross-Plugin Migration: Support migrating collections between plugin types (e.g., ChromaDB → Qdrant)?

  3. Plugin Versioning: Should setups include plugin version info?

  4. System-Wide Setups: Should there be system setups available to all orgs?

  5. Plugin Discovery: Auto-discover from plugins directory or explicit registration?


Success Metrics

After implementation:

  • API key rotation takes < 1 minute
  • Can add plugin types without modifying service code
  • Organizations can offer multiple backend choices
  • ChromaDB plugin identical to pre-plugin architecture
  • All existing tests pass
  • Setup configs never expose credentials
  • Collection creation shows available backends

Next Steps

  1. Review - Get team approval on architecture
  2. Phase 1 - Database schema migration
  3. Phase 2 - Plugin infrastructure
  4. Phase 3 - Service refactoring
  5. Phase 4 - API updates
  6. Phase 5 - LAMB integration
  7. Phase 6 - Testing & validation
  8. Deploy - Gradual rollout with monitoring

Prerequisites: Issue #203 (KnowledgeStore Setups) must be implemented first.

Status: Draft for Review

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions